I munged the directory names intentionally = unnecessarily :)

Basically it's a weird issues (our problem, but more likely naughty
search spiders).

Our site has been structured like this:
http://www.pubcrawler.com/Template/index.cfm

/Template is our root directory

For some reason yet to be determined, getting high amount of requests
for pages like:
http://www.pubcrawler.com/template/index.cfm

(determined these invalid requests during an infrequent clean / bug
tracking of our application server --- admin GUI kept showing a
reoccurring live time 404 request for /template requests).

We have scripts like:
http://www.pubcrawler.com/Template/ReviewWC.cfm/flat/BREWERID=107345

Which dummy spider(s) are munging:
http://www.pubcrawler.com/Template/reviewwc.cfm/flat/BREWERID=107345

and sometimes:
http://www.pubcrawler.com/template/reviewwc.cfm/flat/BREWERID=107345

Most requests seem to be lowercased URLs.

I haven't checked on the originator of the requests yet, nor found the
source of the invalid URLs.

Fortunately, we only have maybe a dozen or two scripts with varied
case names and only one directory.

So it's quite finite to work around this in interim :)

Big problem really is the sheer number of these requests which have
been 404'd since whoever began requesting these malformed requests :)
 We have 10's of millions of page of content, so could be significant.

Interesting mystery --- and something that may be more interesting
when I track down the requester(s).

Jędrzej' regex worked 100% for me.  So part of the issue solved.

Determined Varnish (cache server) up front was caching these as 404's
(shouldn't be) which made perfecting and testing regex manually an
impossible failure. Waiting patiently for Cherokee caching
functionality for balancer content :)    Love Varnish's speed, but
find it a pain to config and regularly have to make the config file
more complicated.



On Sun, Feb 20, 2011 at 11:47 AM, Alvaro Lopez Ortega
<[email protected]> wrote:
> Hello there,
>
> On 20/02/2011, at 16:54, pub crawler wrote:
>
>> We have a high traffic problem.
>>
>> Have a directory:
>>
>> http://www.website.com/Directory/whatever.php
>>
>> Search spiders are going nuts requesting this (1000's of these wrong
>> requests a day)
>> http://www.website.com/directory/whatever.php
>>
>> How do I simply handle this transparently to internally redirect to
>> the proper /Directory subdirectory instead of the wrong /directory
>> subdirectory?
>>
>> (I am at a loss on the regex stuff - anyone with a useful
>> tool/builder/reference please recommend).
>
> I don't think I'm understanding what the problem is.  Neither of the previous 
> URLs is working, actually.
>
> Could you please clarify what the problem is? And, besides, how many of those 
> directories do you have?
>
> --
> Octality
> http://www.octality.com/
>
>
_______________________________________________
Cherokee mailing list
[email protected]
http://lists.octality.com/listinfo/cherokee

Reply via email to