Re: [Robots] Post

2002-11-08 Thread Otis Gospodnetic
I think I remember those proposals, actually. I have never hear anyone mention them anywhere else, so I don't think anyone has implemented a crawler that looks for those new things in robots.txt Otis --- Sean 'Captain Napalm' Conner <[EMAIL PROTECTED]> wrote: > > Well, I was surprised to recen

RE: [Robots] Post

2002-11-08 Thread Otis Gospodnetic
Sounds interesting. I'd love to see some screenshots of some community graphs and main characters in itpossible? Otis --- Nick Arnett <[EMAIL PROTECTED]> wrote: > As long as we're kicking around what's new, here's mine. I've been > working > on a system that finds topical Internet discussion

[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Otis Gospodnetic
> I am working on a robot develpoment, in java,. > We are developing a search enginealmost the > complete engine is developed... > We used java for the devlopment...but the performance > of java api in fetching the web pages is too low, > basically we developed out own URL Connection , as >

[Robots] Re: SV: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Otis Gospodnetic
LWP? Very popular in a big Perl community. --- Rasmus Mohr <[EMAIL PROTECTED]> wrote: > > Any idea how widespread the use of this library is? We've observed > some > weird behaviors from some of the major search engines' spiders > (basically > ignoring robots.txt sections) - maybe this is the

[Robots] Re: Perl and LWP robots

2002-03-07 Thread Otis Gospodnetic
Excellent. I have a copy of Wong's book at home and like that topic (i.e. I'm a potential customer :)) When will it be published? I think lots of people do want to know about recursive spiders, and I bet one of the most frequent obstacles are issues like: queueing, depth vs. breadth first crawl

[Robots] Re: Correct URL, shlash at the end ?

2001-11-22 Thread Otis Gospodnetic
> > The above is just for consideration if the robots.txt is ever > updated so the > > robots could be informed of this little detail. > > There was a push in '96 or '97 to update the robots.txt standard > and I > wrote a proposal back then > (http://www.conman.org/people/spc/robots2.html

[Robots] Re: Correct URL, shlash at the end ?

2001-11-22 Thread Otis Gospodnetic
> - - - - - > Crazy thought... > > This is where the robots.txt file could be used to hold that > information for > the robot agents that need to know the operational order of the "/" > defaults > names used on that service. > > User-agent: * > Slash: default.htm, default.html, index.htm, in

[Robots] Re: Data structures for crawlers?

2001-06-27 Thread Otis Gospodnetic
Hello, Yes, everything you said is fine. I just wanted to write 'custom data structures' and code to handle large amounts of data by flexibly keeping it either in RAM or on disk, instead of using a regular RDBMS for storing that data, like Webbase does. Otis --- Corey Schwartz <[EMAIL PROTECTE

[Robots] Re: Search Engine Spiders and Cookies

2001-06-17 Thread Otis Gospodnetic
Hello, Web 'spiders' act like regular web clients do. Depending on the spider implementation they may accept cookies, store them, and send them back to sites that set them, or they can just completely ignore them. There is no single answer. If you do not want spiders to index your sites there ar

[Robots] Re: Data structures for crawlers?

2001-06-17 Thread Otis Gospodnetic
Yes, I read the Mercator paper multiple times :) I was hoping for more concrete suggestions, but I guess people don't want to share any knowledge :( Thanks, Otis P.S. Nick, the list maintainer, whenever I hit Reply All I end up with multiple [EMAIL PROTECTED] addresses on the To line. It could b

[Robots] Data structures for crawlers?

2001-06-13 Thread Otis Gospodnetic
Hello, Some members of this list probably had to write something like this before I am trying to create/pick a data structure that will allow me to store a fixed number of unique hostnames (call it HostData object) each of which has a fixed-sized list of unique URLs associated with it. For

[Robots] Re: Robots.txt (was: Hello)

2001-06-11 Thread Otis Gospodnetic
You may want to use something like this to make your life easier: http://www.innovation.ch/java/HTTPClient/ Otis --- srinivas mohan <[EMAIL PROTECTED]> wrote: > > Hello Mr Tim Bray, > > Thank you for the suggestion, > but according to my project specification > the crawler should be made in

Re: Looking for a gatherer.

2001-01-23 Thread Otis Gospodnetic
That URL is incorrect. This is the correct one: http://sourceforge.net/projects/jcrawler/ Otis P.S. Hitting "Reply All" put [EMAIL PROTECTED] on the'To' line twice. Is this a list setup problem maybe? --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > I guess you co

Re: Looking for a gatherer.

2001-01-22 Thread Otis Gospodnetic
I guess you could then try JCrawler (http://jcrawler.sourceforge.net/ I believe). JCrawler uses the WebSphinx package. Otis --- "Alexandr \"Xenocid\" Koloskov" <[EMAIL PROTECTED]> wrote: > You could try a WebSphinx. It`s a good spider with > rich set of features and > it`s free. > www.cs.cmu.e

Re: Looking for a gatherer.

2001-01-10 Thread Otis Gospodnetic
Add Larbin to that list. --- "Krishna N. Jha" <[EMAIL PROTECTED]> wrote: > Look into webBase, pavuk, wget - there are some > other similar free > products out there. > (I am not sure I fully understand/appreciate all > your requirements, > though; if you wish, you can clarify them to me.) > We al