Tobias N. Sasse wrote:
And most important the bug: With increasing number of pages I receive zillions of
"java.net.BindException: Address already in use: connect"

Just to let you know, this bug could be fixed. Cause of the problem was that I have been testing under Windows XP and this crappy OS only uses a port range of ~4000 ports for TCP/IP connections, and needs up to 4 minutes to clear them again. Thus connections will be refused once those 4000 are satureated, until the OS clears them again.

Under a linux environment I did not occur these problems. The design questions remain open and I am looking for your feedback!

Some interesting numbers:

With a Dualcore 2x1.8 GHz, 2 GB Ram, S-ATA Drive
Client: Java-6
Server: local with Apache HTTPD 2 (standard config)

I gain a maximum of 146 pages/second. Annotation: The crawler is doing heavy String Processing and strips out all HTML Tags and stuff. As I am actually working on a search engine, I only want to have the plain text of a website, the html and script / image / bla stuff is not relevant for my search algorithms, so I delete them. You could cut back the String processing, that would push the pages/second rate even further. But than there will be more I/O cause the files we are writing are larger.

With kind regards,
Tobi

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to