Re: OT: pretent/hijacked robots?

Chris Hellyar Wed, 27 Sep 2006 20:02:12 -0700

> C.H. - cable is always static. Besides google spiders by website, not
> IP, so shouldn't matter if it was dynamic. All the .pdf are static
> content, so playing with expiry times doesn't seem to be a good idea to
> me? I haven't set any expiry anywhere.


My experience with using configured expiry came from the google spider
doing nuts on the 1000+ small jpegs on one of my sites..  I put the expiry
parameters in an .htaccess file and it fixed the problem.. Same situation,
static files on a fixed IP with fixed DNS.  ie: It fixed it, but I don't
know why it was broke.  (I had bandwidth theft issues so I abstracted the
images so don't need the expiry config any more as it happens.)

> [20060928T051227+1200] -|[EMAIL PROTECTED] 461240 200 - 201
> "/linux/presentation/Photo-Linux-part2.pdf" -
<--- blah blah cropped --->

Links in the pdf being spidered?  If you've got long wrapped urls in your
pdf's (Like you have with that pdf) the google spider can get confused by
the line break, and interpret the second line as a relative link, causing
it to re-spider the file.

In fact the urls are a little munged when viewing with using adobe acrobat
7.0.5 as well.  Page 7, link to www.drycreekphoto.com the second line has
no mouseover/link, page 8 the link to sphoto.com, second line is a link to
xrite.com.  Did you manually do the urls, or does Latex do it?  (not a
latex user myself..)

> Unless apache isn't logging properly? Log the size of the file, not the
> number of bytes actually transferred?

What's the logformat statement you're using there?  It dosn't look like
common_log format? ie, what's the second 3 digit number after the HTTP
return code?

Cheers, Chris H.

Re: OT: pretent/hijacked robots?

Reply via email to