I think I remember those proposals, actually.
I have never hear anyone mention them anywhere else, so I don't think
anyone has implemented a crawler that looks for those new things in
robots.txt
Otis
--- Sean 'Captain Napalm' Conner <[EMAIL PROTECTED]> wrote:
>
> Well, I was surprised to recen
Sounds interesting.
I'd love to see some screenshots of some community graphs and main
characters in itpossible?
Otis
--- Nick Arnett <[EMAIL PROTECTED]> wrote:
> As long as we're kicking around what's new, here's mine. I've been
> working
> on a system that finds topical Internet discussion
> I am working on a robot develpoment, in java,.
> We are developing a search enginealmost the
> complete engine is developed...
> We used java for the devlopment...but the performance
> of java api in fetching the web pages is too low,
> basically we developed out own URL Connection , as
>
LWP? Very popular in a big Perl community.
--- Rasmus Mohr <[EMAIL PROTECTED]> wrote:
>
> Any idea how widespread the use of this library is? We've observed
> some
> weird behaviors from some of the major search engines' spiders
> (basically
> ignoring robots.txt sections) - maybe this is the
Excellent. I have a copy of Wong's book at home and like that topic
(i.e. I'm a potential customer :)) When will it be published?
I think lots of people do want to know about recursive spiders, and I
bet one of the most frequent obstacles are issues like: queueing, depth
vs. breadth first crawl
> > The above is just for consideration if the robots.txt is ever
> updated so the
> > robots could be informed of this little detail.
>
> There was a push in '96 or '97 to update the robots.txt standard
> and I
> wrote a proposal back then
> (http://www.conman.org/people/spc/robots2.html
> - - - - -
> Crazy thought...
>
> This is where the robots.txt file could be used to hold that
> information for
> the robot agents that need to know the operational order of the "/"
> defaults
> names used on that service.
>
> User-agent: *
> Slash: default.htm, default.html, index.htm, in
Hello,
Yes, everything you said is fine. I just wanted to
write 'custom data structures' and code to handle
large amounts of data by flexibly keeping it either in
RAM or on disk, instead of using a regular RDBMS for
storing that data, like Webbase does.
Otis
--- Corey Schwartz <[EMAIL PROTECTE
Hello,
Web 'spiders' act like regular web clients do.
Depending on the spider implementation they may accept
cookies, store them, and send them back to sites that
set them, or they can just completely ignore them.
There is no single answer.
If you do not want spiders to index your sites there
ar
Yes, I read the Mercator paper multiple times :)
I was hoping for more concrete suggestions, but I
guess people don't want to share any knowledge :(
Thanks,
Otis
P.S.
Nick, the list maintainer, whenever I hit Reply All I
end up with multiple [EMAIL PROTECTED] addresses on
the To line. It could b
Hello,
Some members of this list probably had to write
something like this before
I am trying to create/pick a data structure that will
allow me to store a fixed number of unique hostnames
(call it HostData object) each of which has a
fixed-sized list of unique URLs associated with it.
For
You may want to use
something like this to make your
life easier:
http://www.innovation.ch/java/HTTPClient/
Otis
--- srinivas mohan <[EMAIL PROTECTED]> wrote:
>
> Hello Mr Tim Bray,
>
> Thank you for the suggestion,
> but according to my project specification
> the crawler should be made in
That URL is incorrect.
This is the correct one:
http://sourceforge.net/projects/jcrawler/
Otis
P.S.
Hitting "Reply All" put [EMAIL PROTECTED] on the'To'
line twice. Is this a list setup problem maybe?
--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> I guess you co
I guess you could then try JCrawler
(http://jcrawler.sourceforge.net/ I believe).
JCrawler uses the WebSphinx package.
Otis
--- "Alexandr \"Xenocid\" Koloskov"
<[EMAIL PROTECTED]> wrote:
> You could try a WebSphinx. It`s a good spider with
> rich set of features and
> it`s free.
> www.cs.cmu.e
Add Larbin to that list.
--- "Krishna N. Jha" <[EMAIL PROTECTED]> wrote:
> Look into webBase, pavuk, wget - there are some
> other similar free
> products out there.
> (I am not sure I fully understand/appreciate all
> your requirements,
> though; if you wish, you can clarify them to me.)
> We al
15 matches
Mail list logo