Mark,

You have a Web Crawler/Spider/Bot hitting your site; yours happens to be
written as a Java Spider.

Why? It could simply be human/programming error. It could be an email
harvester of some sort (but as your links are NPR related, I don't think
this is it). Perhaps someone is attempting to download your site for offline
viewing and has the wrong domain. There are hundreds of possibilities I
suppose, some even nefarious.

I'm pretty sure you can block bots by utilizing your .htaccess file. You can
use a robots.txt file but alot of bots don't follow the rules. I'd probably
start by getting a list of all the open source Java spiders. I've never had
to do this, so Google is your friend.

Regards,

Kaffeen


On Thu, Jun 3, 2010 at 5:29 PM, Mark Phillip <[email protected]> wrote:

> Evening folks,
>
> I have pretty high expectations for the Refresh Austin list whenever I have
> a tough question, but I might have found one stump-worthy.
>
> A couple months ago I started seeing requests in my web server access log
> for "/ombudsman".  I don't have an Ombudsman page, so it returned a 404.
> Digging a little deeper, the same IP was repeatedly searching for the same
> set of non-existent pages on my site:
>
> /about/privacypolicy.html
> /about/termsofuse.html
> /audiohelp/progstream.html
> /blogs
> /corrections
> /email
> /help
> /help/communityfaq.html
> /music
> /ombudsman
> /podcast
>
> After a bit more digging, I realized that it wasn't coming from just one IP
> address.  Turns out there are dozens of IP addresses all requesting the same
> non-existent URLs.  Each IP is scattered across the globe without any common
> thread.  The only user-agent listed in each request is a member of the
> "Java/1.6.0" family.
>
> I am 100% stumped on this one.  All Googling for community-sourced
> Java-based search spiders comes up completely empty.
>
>
> Any thoughts?  Solve this and I'll buy you a beer on Tuesday.
>
>
>
>
> Thanks,
> Mark
> http://markphillip.com
>
>  --
> Our Web site: http://www.RefreshAustin.org/
>
> You received this message because you are subscribed to the Google Groups
> "Refresh Austin" group.
>
> [ Posting ]
> To post to this group, send email to [email protected]
> Job-related postings should follow http://tr.im/refreshaustinjobspolicy
> We do not accept job posts from recruiters.
>
> [ Unsubscribe ]
> To unsubscribe from this group, send email to
> [email protected]<refresh-austin%[email protected]>
>
> [ More Info ]
> For more options, visit this group at
> http://groups.google.com/group/Refresh-Austin
>



-- 
If you understand, things are just as they are. If you do not understand,
things are just as they are.

-- 
Our Web site: http://www.RefreshAustin.org/

You received this message because you are subscribed to the Google Groups 
"Refresh Austin" group.

[ Posting ]
To post to this group, send email to [email protected]
Job-related postings should follow http://tr.im/refreshaustinjobspolicy
We do not accept job posts from recruiters.

[ Unsubscribe ]
To unsubscribe from this group, send email to 
[email protected]

[ More Info ]
For more options, visit this group at 
http://groups.google.com/group/Refresh-Austin

Reply via email to