Re: [gentoo-dev] sources.gentoo.org instability
2011/12/5 Chí-Thanh Christopher Nguyễn : > Alec Warner schrieb: >>> Seriously, what do we gain from crawlers accessing sources.gentoo.org? I >>> cant >>> really remember seeing it once in a google query result... >> >> We want the site searchable. > The majority of the expensive requests are related to package.mask and use.local.desc queries by crawlers. Like crawling the entire 13000 rev history for package.mask (or similar.) > > Would it be feasible to use mod_rewrite to direct the most expensive > requests to a static copy, which is re-generated every > ${REASONABLE_TIMEFRAME}? For now user-agents that look like a bot get sent to sources2.gentoo.org (via HTTP-302, not a perm redirect) and humans are good on sources.gentoo.org. Assuming the crawlers and indexing systems follow the spec; hopefully all our search resutls do not get rewritten to sources2.gentoo.org (that would surprise me greatly...wait no it wouldn't ;p) Robin added a caching layer for some segments of the application; I am looking at cprofile dumps and discussing pain points with upstream. -A > > > Best regards, > Chí-Thanh Christopher Nguyễn >
Re: [gentoo-dev] sources.gentoo.org instability
Alec Warner schrieb: >> Seriously, what do we gain from crawlers accessing sources.gentoo.org? I >> cant >> really remember seeing it once in a google query result... > > We want the site searchable. >>> The majority of the expensive requests are related to package.mask and >>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev >>> history for package.mask (or similar.) Would it be feasible to use mod_rewrite to direct the most expensive requests to a static copy, which is re-generated every ${REASONABLE_TIMEFRAME}? Best regards, Chí-Thanh Christopher Nguyễn
Re: [gentoo-dev] sources.gentoo.org instability
On Mon, Dec 5, 2011 at 3:48 AM, Andreas K. Huettel wrote: > > Seriously, what do we gain from crawlers accessing sources.gentoo.org? I cant > really remember seeing it once in a google query result... We want the site searchable. > > Possibly it would not even be required to deny all requests, but just deny > everything related to ancient history... > >> Hello, >> >> For a while sources.gentoo.org has been puttering along and its health >> has slowly declined. We migrated it to some newer shiny hardware in an >> attempt to mitigate the problem but that did not pan out. 90% (or >> more) of sources.gentoo.org traffic is crawler bots and not actual >> humans. That being said; if we cannot serve requests to the bots >> within our timeouts we serve 500's instead which is never really what >> we want (particularly when we spent 20s of CPU to calculate 80% of the >> response only to see the client timeout :/.) >> >> The majority of the expensive requests are related to package.mask and >> use.local.desc queries by crawlers. Like crawling the entire 13000 rev >> history for package.mask (or similar.) >> >> While it is likely we will monkey patch viewvc to be less wasteful; in >> the meantime I have removed use.local.desc from sources.gentoo.org >> (and also anoncvs, because they share the same repo.) I hope this is a >> short term (order of weeks) hack. >> >> -A > > -- > Andreas K. Huettel > Gentoo Linux developer > kde, sci, arm, tex, printing > >
Re: [gentoo-dev] sources.gentoo.org instability
Seriously, what do we gain from crawlers accessing sources.gentoo.org? I cant really remember seeing it once in a google query result... Possibly it would not even be required to deny all requests, but just deny everything related to ancient history... > Hello, > > For a while sources.gentoo.org has been puttering along and its health > has slowly declined. We migrated it to some newer shiny hardware in an > attempt to mitigate the problem but that did not pan out. 90% (or > more) of sources.gentoo.org traffic is crawler bots and not actual > humans. That being said; if we cannot serve requests to the bots > within our timeouts we serve 500's instead which is never really what > we want (particularly when we spent 20s of CPU to calculate 80% of the > response only to see the client timeout :/.) > > The majority of the expensive requests are related to package.mask and > use.local.desc queries by crawlers. Like crawling the entire 13000 rev > history for package.mask (or similar.) > > While it is likely we will monkey patch viewvc to be less wasteful; in > the meantime I have removed use.local.desc from sources.gentoo.org > (and also anoncvs, because they share the same repo.) I hope this is a > short term (order of weeks) hack. > > -A -- Andreas K. Huettel Gentoo Linux developer kde, sci, arm, tex, printing
[gentoo-dev] sources.gentoo.org instability
Hello, For a while sources.gentoo.org has been puttering along and its health has slowly declined. We migrated it to some newer shiny hardware in an attempt to mitigate the problem but that did not pan out. 90% (or more) of sources.gentoo.org traffic is crawler bots and not actual humans. That being said; if we cannot serve requests to the bots within our timeouts we serve 500's instead which is never really what we want (particularly when we spent 20s of CPU to calculate 80% of the response only to see the client timeout :/.) The majority of the expensive requests are related to package.mask and use.local.desc queries by crawlers. Like crawling the entire 13000 rev history for package.mask (or similar.) While it is likely we will monkey patch viewvc to be less wasteful; in the meantime I have removed use.local.desc from sources.gentoo.org (and also anoncvs, because they share the same repo.) I hope this is a short term (order of weeks) hack. -A