Lars J. Nilsson wrote:
> > I wouldn't think of Java as the first choice for a high-volume web spider.
> > What are the advantages?

>     - Java's build in network capabilities

except the httpconnection implementations on most of the linux versions
suck :( failure to timeout properly is one common bug.

>     - HTML parsing is a part of the core language

and the HTML parsing is somewhat stricter than it might be, and needs
to be expanded to do anything useful.

>     - Portability (write once debug... sorry, run everywhere)

sort of true. perl/c can be easily made source portable. java
can easily be made non-portable.

>     - Existing n-tier server architectures (JSP, Servlets, EJB, JDBC, JNDI and so on)
>     - Easy scalability (possibly through JavaSpaces and Jini)

I'd agree with these, plus the Threading issues mentioned in another
post. I still think java is a suitable language for building a spider
particularly as most of the work goes into waiting for servers, and
it is perfectly possible (if awkward) to use a seperate language
for writing the parsers.

Look here for some more info on the problems with using java:
http://www.research.compaq.com/SRC/mercator/papers/Java99/final.html

Richard

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to