Re: [gentoo-dev] sources.gentoo.org instability

2011-12-08 Thread Alec Warner
2011/12/5 Chí-Thanh Christopher Nguyễn :
> Alec Warner schrieb:
>>> Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I 
>>> cant
>>> really remember seeing it once in a google query result...
>>
>> We want the site searchable.
>
 The majority of the expensive requests are related to package.mask and
 use.local.desc queries by crawlers. Like crawling the entire 13000 rev
 history for package.mask (or similar.)
>
> Would it be feasible to use mod_rewrite to direct the most expensive
> requests to a static copy, which is re-generated every
> ${REASONABLE_TIMEFRAME}?

For now user-agents that look like a bot get sent to
sources2.gentoo.org (via HTTP-302, not a perm redirect) and humans are
good on sources.gentoo.org. Assuming the crawlers and indexing systems
follow the spec; hopefully all our search resutls do not get rewritten
to sources2.gentoo.org (that would surprise me greatly...wait no it
wouldn't ;p)

Robin added a caching layer for some segments of the application; I am
looking at cprofile dumps and discussing pain points with upstream.

-A

>
>
> Best regards,
> Chí-Thanh Christopher Nguyễn
>



Re: [gentoo-dev] sources.gentoo.org instability

2011-12-05 Thread Chí-Thanh Christopher Nguyễn
Alec Warner schrieb:
>> Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I 
>> cant
>> really remember seeing it once in a google query result...
> 
> We want the site searchable.

>>> The majority of the expensive requests are related to package.mask and
>>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>>> history for package.mask (or similar.)

Would it be feasible to use mod_rewrite to direct the most expensive
requests to a static copy, which is re-generated every
${REASONABLE_TIMEFRAME}?


Best regards,
Chí-Thanh Christopher Nguyễn



Re: [gentoo-dev] sources.gentoo.org instability

2011-12-05 Thread Alec Warner
On Mon, Dec 5, 2011 at 3:48 AM, Andreas K. Huettel  wrote:
>
> Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I cant
> really remember seeing it once in a google query result...

We want the site searchable.

>
> Possibly it would not even be required to deny all requests, but just deny
> everything related to ancient history...
>
>> Hello,
>>
>> For a while sources.gentoo.org has been puttering along and its health
>> has slowly declined. We migrated it to some newer shiny hardware in an
>> attempt to mitigate the problem but that did not pan out. 90% (or
>> more) of sources.gentoo.org traffic is crawler bots and not actual
>> humans. That being said; if we cannot serve requests to the bots
>> within our timeouts we serve 500's instead which is never really what
>> we want (particularly when we spent 20s of CPU to calculate 80% of the
>> response only to see the client timeout :/.)
>>
>> The majority of the expensive requests are related to package.mask and
>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>> history for package.mask (or similar.)
>>
>> While it is likely we will monkey patch viewvc to be less wasteful; in
>> the meantime I have removed use.local.desc from sources.gentoo.org
>> (and also anoncvs, because they share the same repo.) I hope this is a
>> short term (order of weeks) hack.
>>
>> -A
>
> --
> Andreas K. Huettel
> Gentoo Linux developer
> kde, sci, arm, tex, printing
>
>



Re: [gentoo-dev] sources.gentoo.org instability

2011-12-05 Thread Andreas K. Huettel

Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I cant 
really remember seeing it once in a google query result... 

Possibly it would not even be required to deny all requests, but just deny 
everything related to ancient history...

> Hello,
> 
> For a while sources.gentoo.org has been puttering along and its health
> has slowly declined. We migrated it to some newer shiny hardware in an
> attempt to mitigate the problem but that did not pan out. 90% (or
> more) of sources.gentoo.org traffic is crawler bots and not actual
> humans. That being said; if we cannot serve requests to the bots
> within our timeouts we serve 500's instead which is never really what
> we want (particularly when we spent 20s of CPU to calculate 80% of the
> response only to see the client timeout :/.)
> 
> The majority of the expensive requests are related to package.mask and
> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
> history for package.mask (or similar.)
> 
> While it is likely we will monkey patch viewvc to be less wasteful; in
> the meantime I have removed use.local.desc from sources.gentoo.org
> (and also anoncvs, because they share the same repo.) I hope this is a
> short term (order of weeks) hack.
> 
> -A

-- 
Andreas K. Huettel
Gentoo Linux developer
kde, sci, arm, tex, printing




[gentoo-dev] sources.gentoo.org instability

2011-12-04 Thread Alec Warner
Hello,

For a while sources.gentoo.org has been puttering along and its health
has slowly declined. We migrated it to some newer shiny hardware in an
attempt to mitigate the problem but that did not pan out. 90% (or
more) of sources.gentoo.org traffic is crawler bots and not actual
humans. That being said; if we cannot serve requests to the bots
within our timeouts we serve 500's instead which is never really what
we want (particularly when we spent 20s of CPU to calculate 80% of the
response only to see the client timeout :/.)

The majority of the expensive requests are related to package.mask and
use.local.desc queries by crawlers. Like crawling the entire 13000 rev
history for package.mask (or similar.)

While it is likely we will monkey patch viewvc to be less wasteful; in
the meantime I have removed use.local.desc from sources.gentoo.org
(and also anoncvs, because they share the same repo.) I hope this is a
short term (order of weeks) hack.

-A