Re: [Robots] Googlebot complaint (anyone from Google reading?)

2005-08-27 Thread Klaus Johannes Rusch

Fred Atkinson wrote:


 After reading this, I did a search on Google using the advanced
section and listing all sites on one of my domains.  They have listed URLs
that I have explicitly blocked in my robots.txt file.

   It appears that Googlebot is not a well behaved Crawler.
 


It would be helpful if you included some examples.

Just guessing, Google does include pages in search results that have not 
actually crawled but identified based on links from other sites.
You can identify these by the fact that they do not show details, such 
as an extract from the page.


--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

___
Robots mailing list
Robots@mccmedia.com
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Testing a Web Crawler

2004-05-26 Thread Klaus Johannes Rusch
White, Norman E. wrote:
For example,  How do I know if I am pulling in all the pages that I should?
 

How do I know if I am correctly extracting all the links from each page?
(Besides links on html pages, there are links on MS Word pages, and other
types of pages, some in somewhat different formats)
 

I suggest to test with a small, well-defined set of test cases, ideally 
on an isolated local network (a single machine can easily simulate 
multiple hosts) so you don't risk hammering production Web sites during 
testing.
Start with what you consider a representative selection of actual Web 
pages and store them locally, adapt them as needed to your environment 
and see if all the pages you have identified to be pulled in manually 
are also found by your robot.

Since your code may reflect how you think about extracting links, ask 
someone else to do the manual analysis, or use other robots for 
comparison (link checkers such as linklint or Watchfire's Linkbot can be 
very useful).

How do I know if my random selection of sites algorithm is working
correctly?
 

How do you define correctness, that is along which axes should the 
selection algorithm randomize?

--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] links with blanks ?

2003-03-22 Thread Klaus Johannes Rusch
In [EMAIL PROTECTED], Matthias Jaekle [EMAIL PROTECTED] writes:
 http://www.abc.de/los angeles/
 This links are coded: http://www.abc.de/los%20angeles/
 
 How does robots normally handle links like this. Do pages with blanks
 became indexed?
 What does google and other big crawlers do?

As long as the spaces are correctly encoded either as plus signs, or as %20,
the URLs are valid and should work with browsers and crawlers alike.

URLs with spaces that are not encoded are not valid, and only work in some 
browsers. Crawlers most probably don't index those pages either.

-- 
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Re: leading whitespace in robots.txt files

2002-03-25 Thread Klaus Johannes Rusch


In [EMAIL PROTECTED], Sean M. Burke 
[EMAIL PROTECTED] writes:
 User-agent: *
  Disallow: /cgi-bin/
  Disallow: /~mojojojo/misc/
 
 So I've changed it to this, and was about to submit it as a patch for the
 next LWP release:
/^\s*Disallow:\s*(.*)/i
# Silently forgive leading whitespace.
 
 But first, I thought I'd ask the list here: does anyone thing this'd break
 anything? 

The change should not break anything, files using leading whitespace for 
comments or some other obscure purpose do not comply with the specification 
anyway and will see varying results.

However, since the standard is sufficiently clear on the correct format, I 
would rather opt to not support a non-standard format with leading whitespace 
since developers will start relying on this feature and will complain that 
other, standards compliant robots libraries don't support it (the infamous my 
page works in Internet Explorer so I cannot be broken attitude).

Rather than modifying the library I would suggest any application that wants to
handle this content error gracefully should strip leading whitespace prior to 
calling parse().

--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/




[Robots] Re: Perl and LWP robots

2002-03-07 Thread Klaus Johannes Rusch


In [EMAIL PROTECTED], Sean M. Burke 
[EMAIL PROTECTED] writes:
 Aside from basic concepts (don't hammer the server; always obey the
 robots.txt; don't span hosts unless you are really sure that you want to),
 are there any particular bits of wisdom that list members would want me to
 pass on to my readers?

Some thoughts:


* Implement specifications fully, or at least recognize when your 
implementation reaches something it doesn't support

Examples:

Some spiders cannot handle protcol preserving links like
a href=//www.foo.com/something/a
which is a perfectly valid link that should preserve the current protocol, and 
instead access http://currentbase//www.foo.com/


* Identify yourself, set appropriate headres

Spiders should include a unique name and version number (for robots.txt), and 
contact information for the author (a _working_ web site or email address) in 
the user agent string.

Sending valid Referer headers is helpful to understand what a robot is doing, 
too, sending the author's homepage as the referrer usually is not.


* Don't make assumptions on the meaning of URLs

Example:

http://www.foo.com/something and http://www.foo.com/something/ are not 
necessarily the same, nor is the former required to redirect to the latter.

http://www.foo.com/ can return different things depending on parameters of the 
request, or other conditions (time of day, temperature, mood of the server) -- 
depending on the application on the spider take variants of the same URL into 
account.


* Cache server responses when cacheable

At least locally during a run (I dislike spiders requesting 2000 copies of
clear.gif) but preferrably between runs, too (HTTP/1.1 cache control, expires, 
etag)


* Recognize loops (MD5 signatures are your friend, but recognize loops even 
when the content changes slightly)

Example:

Appending /something or ?something to a URL often does not make any difference
to what a web server returns, all it takes is a relative link on that page on 
cunstruct an infinite URL chain, like

http://www.foo.com/page.html/
http://www.foo.com/page.html/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/otherpage/otherpage/


* Expect and handle errors (expect the unexpected :-))

Badly coded content and links are common, expect that code passed to the spider
will not be perfect.


* Beware of suspicious links

Check URLs carefully before following a link, check for fully qualified 
hostnames etc.   Of course spiders are always run off perfectly managed and 
secured machines -- not.

Example:

http://localhost/cgi-bin/phf?...
http://localhost/default.ida?...
http://proxy/




-- 
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: Anti-thesaurus proposal

2001-11-29 Thread Klaus Johannes Rusch


Solon Edmunds wrote:
 
 So has anyone seen/done anything like
 
 div id=robots-txt-noindex-follow class=robots
 {headers/footer/siderbars} /div
 
 div id=robots-txt-noindex-nofollow class=robots
 {a banner area }
 /div
 
 div id=robots-txt-index-nofollow class=robots
 { content for the index, but holds looping links or dynamically
 generated
 links which are best navigated via the statedataless sitemaps links. }
 /div

The id attribute is defined as an ID, as the name implies, so it must be
unique, so this cannot be used to mark areas of the page indexable.

-- 
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] Re: Correct URL, shlash at the end ?

2001-11-21 Thread Klaus Johannes Rusch


In [EMAIL PROTECTED], Matthias Jaekle [EMAIL PROTECTED] writes:
 I read about adding a slash at the end of the URLs, if there is no
 absolut path present.
 
 But what about pathes ending in subdirectories (xyz).
 A link to http://www.abc.de/xyz/ might be more correct then the link
 to http://www.abc.de/xyz
 
 But is there a possibility to find out if somebody who was writing
 http://www.abc.de/xyz is meaning http://www.abc.de/xyz/

Usually the server will send a 302 response for the first URL, unless the 
documents are equivalent.

E.g. http://www.abc.de/xyz.html and http://www.abc.de/xyz.html/sometrackingcode
could be the same page, but there is no way to tell if these are really 
referring to the same document (if the MD5 signature matches, chances are that 
you have found a duplicate, but otherwise two different URLs can always refer 
to two different documents, and a robot should not make any assumptions how a 
URL is interpreted by the server).

-- 
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




Re: Redirect commands

2000-04-19 Thread Klaus Johannes Rusch

In [EMAIL PROTECTED], Zwack, Melanie 
[EMAIL PROTECTED] writes:
 I would like to know how to create a robots.txt file to redirect the robots
 (since our home page has been deleted/moved). The current redirect page that
 is on the server has two redirected links one for parents and one for kids.
 Apparently, I have heard there is a way to make a robots.txt file redirect
 from this sort of page.

There is no redirect option in robots.txt.

Many robots will honor HTTP redirects (that is, status codes 301 and 302) and
ROBOTS meta tag (in your case probably NOINDEX,FOLLOW)

Klaus Johannes Rusch
--
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/




Re: FW: Stemming and Wildcards in robots.txt files

2000-03-15 Thread Klaus Johannes Rusch

In [EMAIL PROTECTED], Jonathan Knoll 
[EMAIL PROTECTED] writes:
 OK, then is there a way to create an internal wildcard?

 User-agent: *
 Disallow: /*/97
 Disallow: /*/98

No, you will need to use

meta name=ROBOTS content=NOINDEX

in the HTML source for those robots that honor the meta tag. robots.txt does
not provide a mechanism for this.

Klaus Johannes Rusch
--
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/