Personally I used Droids to crawl a website of approximately 250000 pages. The queue was stored in memory and I arbitrarily allocated 1GB of memory to java. Everything worked fine.
That's not a large number of webpages but I think droids' current implementation is well suited for such jobs: crawling a relatively small set of webpage or crawling an intranet. This is particularly right if you need to customize the handling process of the pages. I Hope this experience may help. Bertil Chapuis On Nov 14, 2009, at 3:59 AM, Otis Gospodnetic wrote: > OK, thanks. > > So how do people really use Droids at scale? e.g. crawling a large number of > web pages? I happen to use it for something smalish, so I never had issues > with the queue being in the JVM heap and getting OOMs because of that. But I > imagine that anyone using it for a larger crawl would hit OOM sooner or > later, no? > > Does this imply that either nobody is using Droids for large-scale crawls, or > that everyone who does implemented their own, custom disk-backed queue? > > > Thanks, > Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > ----- Original Message ---- >> From: Ryan McKinley <[email protected]> >> To: [email protected] >> Sent: Fri, November 13, 2009 5:17:51 PM >> Subject: Re: Queue: in memory or on disk? >> >> ya, the standard one is in memory. >> >> It is easy to write one to store things to disk or whatever -- I use one >> that >> stores tasks to an h2 database, but it is not general enough to contribute >> back... >> >> I think Migfa was looking at replacing the droids Queue interface with a >> standard java.util.Queue interface >> >> ryan >> >> >> On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote: >> >>> I think the current implementation only provides in memory queues of tasks. >> However, since the TaskQueue interface is relatively simple it shouldn't be >> too >> hard to persists the data on the disk or to implement a TaskQueue which >> works >> with a JMS broker or something else. >>> >>> >>> On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote: >>> >>>> Hello, >>>> >>>> I haven't looked at the sources. But who stores items put in the Queue? >>>> Are >> they in memory, or does something write them to disk, or something else? >>>> >>>> Thanks, >>>> Otis >>>> -- >>>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls >>>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR >>>> >>> >
