Re: [Robots] Googlebot complaint (anyone from Google reading?)
Fred Atkinson wrote: After reading this, I did a search on Google using the advanced section and listing all sites on one of my domains. They have listed URLs that I have explicitly blocked in my robots.txt file. It appears that Googlebot is not a well behaved Crawler. It would be helpful if you included some examples. Just guessing, Google does include pages in search results that have not actually crawled but identified based on links from other sites. You can identify these by the fact that they do not show details, such as an extract from the page. -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ ___ Robots mailing list Robots@mccmedia.com http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Testing a Web Crawler
White, Norman E. wrote: For example, How do I know if I am pulling in all the pages that I should? How do I know if I am correctly extracting all the links from each page? (Besides links on html pages, there are links on MS Word pages, and other types of pages, some in somewhat different formats) I suggest to test with a small, well-defined set of test cases, ideally on an isolated local network (a single machine can easily simulate multiple hosts) so you don't risk hammering production Web sites during testing. Start with what you consider a representative selection of actual Web pages and store them locally, adapt them as needed to your environment and see if all the pages you have identified to be pulled in manually are also found by your robot. Since your code may reflect how you think about extracting links, ask someone else to do the manual analysis, or use other robots for comparison (link checkers such as linklint or Watchfire's Linkbot can be very useful). How do I know if my random selection of sites algorithm is working correctly? How do you define correctness, that is along which axes should the selection algorithm randomize? -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] links with blanks ?
In [EMAIL PROTECTED], Matthias Jaekle [EMAIL PROTECTED] writes: http://www.abc.de/los angeles/ This links are coded: http://www.abc.de/los%20angeles/ How does robots normally handle links like this. Do pages with blanks became indexed? What does google and other big crawlers do? As long as the spaces are correctly encoded either as plus signs, or as %20, the URLs are valid and should work with browsers and crawlers alike. URLs with spaces that are not encoded are not valid, and only work in some browsers. Crawlers most probably don't index those pages either. -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: leading whitespace in robots.txt files
In [EMAIL PROTECTED], Sean M. Burke [EMAIL PROTECTED] writes: User-agent: * Disallow: /cgi-bin/ Disallow: /~mojojojo/misc/ So I've changed it to this, and was about to submit it as a patch for the next LWP release: /^\s*Disallow:\s*(.*)/i # Silently forgive leading whitespace. But first, I thought I'd ask the list here: does anyone thing this'd break anything? The change should not break anything, files using leading whitespace for comments or some other obscure purpose do not comply with the specification anyway and will see varying results. However, since the standard is sufficiently clear on the correct format, I would rather opt to not support a non-standard format with leading whitespace since developers will start relying on this feature and will complain that other, standards compliant robots libraries don't support it (the infamous my page works in Internet Explorer so I cannot be broken attitude). Rather than modifying the library I would suggest any application that wants to handle this content error gracefully should strip leading whitespace prior to calling parse(). -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/
[Robots] Re: Perl and LWP robots
In [EMAIL PROTECTED], Sean M. Burke [EMAIL PROTECTED] writes: Aside from basic concepts (don't hammer the server; always obey the robots.txt; don't span hosts unless you are really sure that you want to), are there any particular bits of wisdom that list members would want me to pass on to my readers? Some thoughts: * Implement specifications fully, or at least recognize when your implementation reaches something it doesn't support Examples: Some spiders cannot handle protcol preserving links like a href=//www.foo.com/something/a which is a perfectly valid link that should preserve the current protocol, and instead access http://currentbase//www.foo.com/ * Identify yourself, set appropriate headres Spiders should include a unique name and version number (for robots.txt), and contact information for the author (a _working_ web site or email address) in the user agent string. Sending valid Referer headers is helpful to understand what a robot is doing, too, sending the author's homepage as the referrer usually is not. * Don't make assumptions on the meaning of URLs Example: http://www.foo.com/something and http://www.foo.com/something/ are not necessarily the same, nor is the former required to redirect to the latter. http://www.foo.com/ can return different things depending on parameters of the request, or other conditions (time of day, temperature, mood of the server) -- depending on the application on the spider take variants of the same URL into account. * Cache server responses when cacheable At least locally during a run (I dislike spiders requesting 2000 copies of clear.gif) but preferrably between runs, too (HTTP/1.1 cache control, expires, etag) * Recognize loops (MD5 signatures are your friend, but recognize loops even when the content changes slightly) Example: Appending /something or ?something to a URL often does not make any difference to what a web server returns, all it takes is a relative link on that page on cunstruct an infinite URL chain, like http://www.foo.com/page.html/ http://www.foo.com/page.html/otherpage/ http://www.foo.com/page.html/otherpage/otherpage/ http://www.foo.com/page.html/otherpage/otherpage/otherpage/ http://www.foo.com/page.html/otherpage/otherpage/otherpage/otherpage/ * Expect and handle errors (expect the unexpected :-)) Badly coded content and links are common, expect that code passed to the spider will not be perfect. * Beware of suspicious links Check URLs carefully before following a link, check for fully qualified hostnames etc. Of course spiders are always run off perfectly managed and secured machines -- not. Example: http://localhost/cgi-bin/phf?... http://localhost/default.ida?... http://proxy/ -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Anti-thesaurus proposal
Solon Edmunds wrote: So has anyone seen/done anything like div id=robots-txt-noindex-follow class=robots {headers/footer/siderbars} /div div id=robots-txt-noindex-nofollow class=robots {a banner area } /div div id=robots-txt-index-nofollow class=robots { content for the index, but holds looping links or dynamically generated links which are best navigated via the statedataless sitemaps links. } /div The id attribute is defined as an ID, as the name implies, so it must be unique, so this cannot be used to mark areas of the page indexable. -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Correct URL, shlash at the end ?
In [EMAIL PROTECTED], Matthias Jaekle [EMAIL PROTECTED] writes: I read about adding a slash at the end of the URLs, if there is no absolut path present. But what about pathes ending in subdirectories (xyz). A link to http://www.abc.de/xyz/ might be more correct then the link to http://www.abc.de/xyz But is there a possibility to find out if somebody who was writing http://www.abc.de/xyz is meaning http://www.abc.de/xyz/ Usually the server will send a 302 response for the first URL, unless the documents are equivalent. E.g. http://www.abc.de/xyz.html and http://www.abc.de/xyz.html/sometrackingcode could be the same page, but there is no way to tell if these are really referring to the same document (if the MD5 signature matches, chances are that you have found a duplicate, but otherwise two different URLs can always refer to two different documents, and a robot should not make any assumptions how a URL is interpreted by the server). -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
Re: Redirect commands
In [EMAIL PROTECTED], Zwack, Melanie [EMAIL PROTECTED] writes: I would like to know how to create a robots.txt file to redirect the robots (since our home page has been deleted/moved). The current redirect page that is on the server has two redirected links one for parents and one for kids. Apparently, I have heard there is a way to make a robots.txt file redirect from this sort of page. There is no redirect option in robots.txt. Many robots will honor HTTP redirects (that is, status codes 301 and 302) and ROBOTS meta tag (in your case probably NOINDEX,FOLLOW) Klaus Johannes Rusch -- [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/
Re: FW: Stemming and Wildcards in robots.txt files
In [EMAIL PROTECTED], Jonathan Knoll [EMAIL PROTECTED] writes: OK, then is there a way to create an internal wildcard? User-agent: * Disallow: /*/97 Disallow: /*/98 No, you will need to use meta name=ROBOTS content=NOINDEX in the HTML source for those robots that honor the meta tag. robots.txt does not provide a mechanism for this. Klaus Johannes Rusch -- [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/