RE: [Robots] Post
Sounds interesting. I'd love to see some screenshots of some community graphs and main characters in itpossible? Otis --- Nick Arnett [EMAIL PROTECTED] wrote: As long as we're kicking around what's new, here's mine. I've been working on a system that finds topical Internet discussions (web forums, usenet, mailing lists) and does some analysis of who's who, looking for the people who connect communities together, lead discussions, etc. At the moment, it's focusing on Java developers. It's been quite interesting to see what it discovers in terms of how various subtopics are related and what other things that Java developers tend to be interested in. Regarding markup, etc., in the back of my mind I've had the notion of enhancing my spider to recognize how to parse and recurse forums and list archives, so that I don't have to write new code for every different forum or archiving format. But it's not something I'd be comfortable tossing out into the open, since it obviously would be a tool that spammers could use for address harvesting. I'm essentially creating a toolbox with Python and MySQL, which I'm using to create custom information products for consulting clients. For the moment, those (obviously) are companies with a strong interest in Java. Nick -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots __ Do you Yahoo!? U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Post
I think I remember those proposals, actually. I have never hear anyone mention them anywhere else, so I don't think anyone has implemented a crawler that looks for those new things in robots.txt Otis --- Sean 'Captain Napalm' Conner [EMAIL PROTECTED] wrote: Well, I was surprised to recently find that O'Reilly has mentioned me in their book _HTTP: The Definitive Guide_; seems they mentioned my propsed draft extentions to the Robot Exclusion Protocol [1] although I'm not sure what they said about it (my friend actually found the reference in O'Reilly, I haven't had a chance to check it out myself---page 230 if any one has it). Does anyone know if any robots out there actually implement any of the proposals? It'd be interesting to know. -spc (Six years since that was proposed ... ) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots __ Do you Yahoo!? U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: SV: matching and UserAgent: in robots.txt
LWP? Very popular in a big Perl community. --- Rasmus Mohr [EMAIL PROTECTED] wrote: Any idea how widespread the use of this library is? We've observed some weird behaviors from some of the major search engines' spiders (basically ignoring robots.txt sections) - maybe this is the explanation? -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com Remember that there are no bugs, only undocumented features. -- -Oprindelig meddelelse- Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa vegne af Sean M. Burke Sendt: 14. marts 2002 11:08 Til: [EMAIL PROTECTED] Emne: [Robots] matching and UserAgent: in robots.txt I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says User-Agent: ...foo..., it extracts the foo, and if the name of the current user-agent is a substring of ...foo..., then it considers this line as applying to it. So if the agent being modeled is called Banjo, and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me! However, the substring matching currently goes only one way. So if the user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me! I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the User-Agent robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just Banjo), and then seeing if /that/ is a substring of a given robots.txt User-Agent: line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Look at Java 1.4, it addresses these issues (socket timeouts, non-blocking IO, etc.) Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... I thought Mercator numbers were pretty good, no? Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. You could look at Python, Ultraseek was/is written in it, from what I remember. Also, obviously Perl has been used for writing big crawlers, so you can use that, too. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Of course, the choice of a language is not a performance panacea. Otis __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Perl and LWP robots
Excellent. I have a copy of Wong's book at home and like that topic (i.e. I'm a potential customer :)) When will it be published? I think lots of people do want to know about recursive spiders, and I bet one of the most frequent obstacles are issues like: queueing, depth vs. breadth first crawling, (memory) efficient storage of extracted and crawled links, etc. I think that if you covered those topics well lots of people would be very greatful. Thank you for asking, I hope this helps. Otis --- Sean M. Burke [EMAIL PROTECTED] wrote: Hi all! My name is Sean Burke, and I'm writing a book for O'Reilly, which is to basically replace the Clinton Wong's now out-of-print /Web Client Programming with Perl/. In my book draft so far, I haven't discussed actual recursive spiders (I've only discussed getting a given page, and then every page that it links to which is also on the same host), since I think that most readers that think they want a recursive spider, really don't. But it has been suggested that I cover recursive spiders, just for sake of completeness. Aside from basic concepts (don't hammer the server; always obey the robots.txt; don't span hosts unless you are really sure that you want to), are there any particular bits of wisdom that list members would want me to pass on to my readers? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Correct URL, shlash at the end ?
The above is just for consideration if the robots.txt is ever updated so the robots could be informed of this little detail. There was a push in '96 or '97 to update the robots.txt standard and I wrote a proposal back then (http://www.conman.org/people/spc/robots2.html) and while I still get the occasional email about it to my knowledge, no robot has implemented it (some portions perhaps, but not everything). I only mention this because it was attempted before. Yes, that's the problem. Robot writers don't care for robots.txt improvements (it would just slow their robots down), and people in charge of maintaining robots.txt rarely/never hear about ideas discussed on this list. The push has to come from the latter group, I believe. I'm writing a simple little thing for indexing my Opera bookmarks which includes a crawler component, so if I have time I'll try implementing as much of the stuff mentioned in your proposal as I can, just for fun (some fun). Otis (where is my Turkey?) Gospodnetic __ Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1 -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Data structures for crawlers?
Hello, Yes, everything you said is fine. I just wanted to write 'custom data structures' and code to handle large amounts of data by flexibly keeping it either in RAM or on disk, instead of using a regular RDBMS for storing that data, like Webbase does. Otis --- Corey Schwartz [EMAIL PROTECTED] wrote: Implementing a FIFO queue will certainly work for the crawler but is not friendly toward the websites being crawled. Using a FIFO queue, as you mentioned, means that you are doing a breadth first search through the site. It is very likely that you will send hundreds of page requests to the same server in a very short amount of time. Depending on how you design your data structures you should be able to record the time of the last request for a page from any particular server and pace the requests so that you don't request more than a page every few minutes. It sounds like you are implementing this as a recursive call to a crawl function. It seems to me that you should parse out the URL into a scheme/server/path/filename/port and store all of that information in a database of your choice, along with other important data such as the number of times you've visited a site, when the last visit was made, whether the site is still active, etc. Corey -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail http://personal.mail.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Search Engine Spiders and Cookies
Hello, Web 'spiders' act like regular web clients do. Depending on the spider implementation they may accept cookies, store them, and send them back to sites that set them, or they can just completely ignore them. There is no single answer. If you do not want spiders to index your sites there are a few 'standard' and proper way of doing it. For information about 'Robot Exclusion Standard' see www.robotstxt.org. Otis --- Dave Watson [EMAIL PROTECTED] wrote: Hello there Does anybody please have any information on the specific difficulties search engine spiders have with cookies and what methods can be used to circumvent the problems they face? I am a marketing consultant working with a range of clients who use log file analysis tools such as WebTrends and RedSheriff. Both of these tools (and others) use (or at least have an option to use) cookies in order to track unique visitors to web sites. My understanding is that search engine spiders cannot accept cookies and consequently will not trawl a site that serves cookies to them. I understand that cloaking is a POSSIBLE solution - however this in itself raises a few questions for me: 1) If cloaking is an acceptable method to get round the problem of sending away a search engine spider (and consequently not indexing pages within a site), how should this technology be implemented and what pitfalls need to be considered. 2) Most search engines state clearly that they do ban sites altogether from their indexes if they discover cloaking has been used? Are there any other methods that are preferable? The manufacturers of both WebTrends and RedSheriff have not provided me with any information on this matter and so I would like to raise issues highlighted in any responses so that they can respond formally. I have to say that although I have a technical background I am not a programmer or software developer - but I have been learning a lot simply by subscribing to this list - keep up the great work. However, I am looking for a reasonably technical answer (or where I might go to find one please) so that I can pass on specific suggestions to my client, WebTrends and RedSheriff. Thanks for any help you can give. Regards Dave Watson. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Spot the hottest trends in music, movies, and more. http://buzz.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
Re: Looking for a gatherer.
Add Larbin to that list. --- Krishna N. Jha [EMAIL PROTECTED] wrote: Look into webBase, pavuk, wget - there are some other similar free products out there. (I am not sure I fully understand/appreciate all your requirements, though; if you wish, you can clarify them to me.) We also have web-crawlers which offer more flexibility - but are not free. Hope that helps, Krishna Jha Mark Friedman wrote: I am looking for a spider/gatherer with the following characteristics: * Enables the control of the crawling process by URL substring/regexp and HTML context of the link. * Enables the control of the gathering (i.e. saving) processes by URL substring/regexp, MIME type, other header information and ideally by some predicates on the HTML source. * Some way to save page/document metadata, ideally in a database. * Freeware, shareware or otherwise inexpensive would be nice. Thanks in advance for any help. -Mark __ Do You Yahoo!? Yahoo! Photos - Share your holiday photos online! http://photos.yahoo.com/