RE: [Robots] Post

2002-11-08 Thread Otis Gospodnetic
Sounds interesting.
I'd love to see some screenshots of some community graphs and main
characters in itpossible?

Otis

--- Nick Arnett [EMAIL PROTECTED] wrote:
 As long as we're kicking around what's new, here's mine.  I've been
 working
 on a system that finds topical Internet discussions (web forums,
 usenet,
 mailing lists) and does some analysis of who's who, looking for the
 people
 who connect communities together, lead discussions, etc.  At the
 moment,
 it's focusing on Java developers.  It's been quite interesting to see
 what
 it discovers in terms of how various subtopics are related and what
 other
 things that Java developers tend to be interested in.
 
 Regarding markup, etc., in the back of my mind I've had the notion of
 enhancing my spider to recognize how to parse and recurse forums and
 list
 archives, so that I don't have to write new code for every different
 forum
 or archiving format.  But it's not something I'd be comfortable
 tossing out
 into the open, since it obviously would be a tool that spammers could
 use
 for address harvesting.
 
 I'm essentially creating a toolbox with Python and MySQL, which I'm
 using to
 create custom information products for consulting clients.  For the
 moment,
 those (obviously) are companies with a strong interest in Java.
 
 Nick
 
 --
 Nick Arnett
 Phone/fax: (408) 904-7198
 [EMAIL PROTECTED]
 
 ___
 Robots mailing list
 [EMAIL PROTECTED]
 http://www.mccmedia.com/mailman/listinfo/robots


__
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



Re: [Robots] Post

2002-11-08 Thread Otis Gospodnetic
I think I remember those proposals, actually.
I have never hear anyone mention them anywhere else, so I don't think
anyone has implemented a crawler that looks for those new things in
robots.txt

Otis

--- Sean 'Captain Napalm' Conner [EMAIL PROTECTED] wrote:
 
   Well, I was surprised to recently find that O'Reilly has mentioned
 me in
 their book _HTTP: The Definitive Guide_; seems they mentioned my
 propsed
 draft extentions to the Robot Exclusion Protocol [1] although I'm not
 sure
 what they said about it (my friend actually found the reference in
 O'Reilly,
 I haven't had a chance to check it out myself---page 230 if any one
 has it).  
 
   Does anyone know if any robots out there actually implement any of
 the
 proposals?  It'd be interesting to know.
 
   -spc (Six years since that was proposed ... )
 
 ___
 Robots mailing list
 [EMAIL PROTECTED]
 http://www.mccmedia.com/mailman/listinfo/robots


__
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



[Robots] Re: SV: matching and UserAgent: in robots.txt

2002-03-14 Thread Otis Gospodnetic


LWP?  Very popular in a big Perl community.

--- Rasmus Mohr [EMAIL PROTECTED] wrote:
 
 Any idea how widespread the use of this library is? We've observed
 some
 weird behaviors from some of the major search engines' spiders
 (basically
 ignoring robots.txt sections) - maybe this is the explanation?
 
 --
 Rasmus T. MohrDirect  : +45 36 910 122
 Application Developer Mobile  : +45 28 731 827
 Netpointers Intl. ApS Phone   : +45 70 117 117
 Vestergade 18 B   Fax : +45 70 115 115
 1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
 Denmark   Website : http://www.netpointers.com
 
 Remember that there are no bugs, only undocumented features.
 --
 
 -Oprindelig meddelelse-
 Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa
 vegne af Sean M. Burke
 Sendt: 14. marts 2002 11:08
 Til: [EMAIL PROTECTED]
 Emne: [Robots] matching and UserAgent: in robots.txt
 
 
 
 I'm a bit perplexed over whether the current Perl library
 WWW::RobotRules 
 implements a certain part of the Robots Exclusion Standard correctly.
  So 
 forgive me if this seems a simple question, but my reading of the
 Robots 
 Exclusion Standard hasn't really cleared it up in my mind yet.
 
 
 Basically the current WWW::RobotRules logic is this:
 As a WWW:::RobotRules object is parsing the lines in the robots.txt
 file, 
 if it sees a line that says User-Agent: ...foo..., it extracts the
 foo, 
 and if the name of the current user-agent is a substring of
 ...foo..., 
 then it considers this line as applying to it.
 
 So if the agent being modeled is called Banjo, and the robots.txt
 line 
 being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the
 
 library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo,
 Stuff', 
 so this rule is talking to me!
 
 However, the substring matching currently goes only one way.  So if
 the 
 user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html
 
 [EMAIL PROTECTED]] and the robots.txt line being parsed says
 User-Agent: 
 Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 
 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring
 of 
 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!
 
 I have the feeling that that's not right -- notably because that
 means that 
 every robot ID string has to appear in toto on the User-Agent
 robots.txt 
 line, which is clearly a bad thing.
 But before I submit a patch, I'm tempted to ask... what /is/ the
 proper 
 behavior?
 
 Maybe shave the current user-agent's name at the first slash or space
 
 (getting just Banjo), and then seeing if /that/ is a substring of a
 given 
 robots.txt User-Agent: line?
 
 --
 Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/
 
 
 --
 This message was sent by the Internet robots and spiders discussion
 list
 ([EMAIL PROTECTED]).  For list server commands, send help in the
 body of
 a message to [EMAIL PROTECTED].
 
 --
 This message was sent by the Internet robots and spiders discussion
 list ([EMAIL PROTECTED]).  For list server commands, send help in
 the body of a message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Otis Gospodnetic


 I am working on a robot develpoment, in java,.
 We are developing a search enginealmost the 
 complete engine is developed...
 We used  java for the devlopment...but the performance
 of java api in fetching the web pages is too low,
 basically we developed out own URL Connection , as
 we dont have some features like timeout...
 supported  by the java.net.URLConnection api ..

Look at Java 1.4, it addresses these issues (socket timeouts,
non-blocking IO, etc.)

 Though there are better spiders in java..like
 Mercator..we could not achive a better performance
 with our product...

I thought Mercator numbers were pretty good, no?

 Now as the performance is  low..we wanted to redevelop
 our spider..in a language like c or perl...and use
 it with our existing product..

You could look at Python, Ultraseek was/is written in it, from what I
remember.
Also, obviously Perl has been used for writing big crawlers, so you can
use that, too.

 I will be thankful if any one can help me choosing 
 the better language..where i can get better performance..

Of course, the choice of a language is not a performance panacea.

Otis


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: Perl and LWP robots

2002-03-07 Thread Otis Gospodnetic


Excellent.  I have a copy of Wong's book at home and like that topic
(i.e. I'm a potential customer :))  When will it be published?
I think lots of people do want to know about recursive spiders, and I
bet one of the most frequent obstacles are issues like: queueing, depth
vs. breadth first crawling, (memory) efficient storage of extracted and
crawled links, etc.
I think that if you covered those topics well lots of people would be
very greatful.

Thank you for asking, I hope this helps.
Otis

--- Sean M. Burke [EMAIL PROTECTED] wrote:
 
 Hi all!
 My name is Sean Burke, and I'm writing a book for O'Reilly, which is
 to 
 basically replace the Clinton Wong's now out-of-print /Web Client 
 Programming with Perl/.  In my book draft so far, I haven't discussed
 
 actual recursive spiders (I've only discussed getting a given page,
 and 
 then every page that it links to which is also on the same host),
 since I 
 think that most readers that think they want a recursive spider,
 really don't.
 But it has been suggested that I cover recursive spiders, just for
 sake of 
 completeness.
 
 Aside from basic concepts (don't hammer the server; always obey the 
 robots.txt; don't span hosts unless you are really sure that you want
 to), 
 are there any particular bits of wisdom that list members would want
 me to 
 pass on to my readers?
 
 --
 Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/
 
 
 --
 This message was sent by the Internet robots and spiders discussion
 list ([EMAIL PROTECTED]).  For list server commands, send help in
 the body of a message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Try FREE Yahoo! Mail - the world's greatest free email!
http://mail.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: Correct URL, shlash at the end ?

2001-11-22 Thread Otis Gospodnetic


   The above is just for consideration if the robots.txt is ever
 updated so the 
  robots could be informed of this little detail.   
 
   There was a push in '96 or '97 to update the robots.txt standard
 and I
 wrote a proposal back then
 (http://www.conman.org/people/spc/robots2.html)
 and while I still get the occasional email about it to my knowledge,
 no
 robot has implemented it (some portions perhaps, but not everything).
  I
 only mention this because it was attempted before.

Yes, that's the problem.  Robot writers don't care for robots.txt
improvements (it would just slow their robots down), and people in
charge of maintaining robots.txt rarely/never hear about ideas
discussed on this list.  The push has to come from the latter group, I
believe.

I'm writing a simple little thing for indexing my Opera bookmarks which
includes a crawler component, so if I have time I'll try implementing
as much of the stuff mentioned in your proposal as I can, just for fun
(some fun).

Otis (where is my Turkey?) Gospodnetic



__
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] Re: Data structures for crawlers?

2001-06-27 Thread Otis Gospodnetic


Hello,

Yes, everything you said is fine. I just wanted to
write 'custom data structures' and code to handle
large amounts of data by flexibly keeping it either in
RAM or on disk, instead of using a regular RDBMS for
storing that data, like Webbase does.

Otis

--- Corey Schwartz [EMAIL PROTECTED] wrote:
 
 Implementing a FIFO queue will certainly work for
 the crawler but is not
 friendly toward the websites being crawled.  Using a
 FIFO queue, as you
 mentioned, means that you are doing a breadth first
 search through the site.
 It is very likely that you will send hundreds of
 page requests to the same
 server in a very short amount of time.  Depending on
 how you design your
 data structures you should be able to record the
 time of the last request
 for a page from any particular server and pace the
 requests so that you
 don't request more than a page every few minutes.
 
 It sounds like you are implementing this as a
 recursive call to a crawl
 function.  It seems to me that you should parse out
 the URL into a
 scheme/server/path/filename/port and store all of
 that information in a
 database of your choice, along with other important
 data such as the number
 of times you've visited a site, when the last visit
 was made, whether the
 site is still active, etc.
 
 Corey
 
 
 --
 This message was sent by the Internet robots and
 spiders discussion list ([EMAIL PROTECTED]).  For
 list server commands, send help in the body of a
 message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] Re: Search Engine Spiders and Cookies

2001-06-17 Thread Otis Gospodnetic


Hello,

Web 'spiders' act like regular web clients do.
Depending on the spider implementation they may accept
cookies, store them, and send them back to sites that
set them, or they can just completely ignore them.
There is no single answer.
If you do not want spiders to index your sites there
are a few 'standard' and proper way of doing it. For
information about 'Robot Exclusion Standard' see
www.robotstxt.org.

Otis


--- Dave Watson
[EMAIL PROTECTED] wrote:
 
 Hello there
 
 Does anybody please have any information on the
 specific
 difficulties search engine spiders have with cookies
 and what
 methods can be used to circumvent the problems they
 face?
 
 I am a marketing consultant working with a range of
 clients who
 use log file analysis tools such as WebTrends and
 RedSheriff.
 Both of these tools (and others) use (or at least
 have an option
 to use) cookies in order to track unique visitors to
 web sites.
 My understanding is that search engine spiders
 cannot accept
 cookies and consequently will not trawl a site that
 serves
 cookies to them. I understand that cloaking is a
 POSSIBLE
 solution - however this in itself raises a few
 questions for me:
 
 1) If cloaking is an acceptable method to get round
 the problem
 of sending away a search engine spider (and
 consequently not
 indexing pages within a site), how should this
 technology be
 implemented and what pitfalls need to be considered.
 2) Most search engines state clearly that they do
 ban sites
 altogether from their indexes if they discover
 cloaking has been
 used? Are there any other methods that are
 preferable?
 
 The manufacturers of both WebTrends and RedSheriff
 have not
 provided me with any information on this matter and
 so I would
 like to raise issues highlighted in any responses so
 that they
 can respond formally.
 
 I have to say that although I have a technical
 background I am
 not a programmer or software developer - but I have
 been learning
 a lot simply by subscribing to this list - keep up
 the great
 work. However, I am looking for a reasonably
 technical answer (or
 where I might go to find one please) so that I can
 pass on
 specific suggestions to my client, WebTrends and
 RedSheriff.
 
 Thanks for any help you can give.
 
 Regards
 
 Dave Watson.
 
 
 --
 This message was sent by the Internet robots and
 spiders discussion list ([EMAIL PROTECTED]).  For
 list server commands, send help in the body of a
 message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Spot the hottest trends in music, movies, and more.
http://buzz.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




Re: Looking for a gatherer.

2001-01-10 Thread Otis Gospodnetic

Add Larbin to that list.

--- Krishna N. Jha [EMAIL PROTECTED] wrote:
 Look into webBase, pavuk, wget - there are some
 other similar free
 products out there.
 (I am not sure I fully understand/appreciate all
 your requirements,
 though; if you wish, you can clarify them to me.)
 We also have web-crawlers which offer more
 flexibility - but are not
 free.

 Hope that helps,
 Krishna Jha

 Mark Friedman wrote:
 
  I am looking for a spider/gatherer with the
 following characteristics:
 
 * Enables the control of the crawling process
 by URL
   substring/regexp and HTML context of the
 link.
 * Enables the control of the gathering (i.e.
 saving) processes by
   URL substring/regexp, MIME type, other header
 information and
   ideally by some predicates on the HTML
 source.
 * Some way to save page/document metadata,
 ideally in a database.
 * Freeware, shareware or otherwise inexpensive
 would be nice.
 
  Thanks in advance for any help.
 
  -Mark


__
Do You Yahoo!?
Yahoo! Photos - Share your holiday photos online!
http://photos.yahoo.com/