Personally I used Droids to crawl a website of approximately 250000 pages. The 
queue was stored in memory and I arbitrarily allocated 1GB of memory to java. 
Everything worked fine. 

That's not a large number of webpages but I think droids' current 
implementation is well suited for such jobs: crawling a relatively small set of 
webpage or crawling an intranet. This is particularly right if you need to 
customize the handling process of the pages. 

I Hope this experience may help.

Bertil Chapuis


On Nov 14, 2009, at 3:59 AM, Otis Gospodnetic wrote:

> OK, thanks.
> 
> So how do people really use Droids at scale? e.g. crawling a large number of 
> web pages?  I happen to use it for something smalish, so I never had issues 
> with the queue being in the JVM heap and getting OOMs because of that.  But I 
> imagine that anyone using it for a larger crawl would hit OOM sooner or 
> later, no?
> 
> Does this imply that either nobody is using Droids for large-scale crawls, or 
> that everyone who does implemented their own, custom disk-backed queue?
> 
> 
> Thanks,
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
>> From: Ryan McKinley <[email protected]>
>> To: [email protected]
>> Sent: Fri, November 13, 2009 5:17:51 PM
>> Subject: Re: Queue: in memory or on disk?
>> 
>> ya, the standard one is in memory.
>> 
>> It is easy to write one to store things to disk or whatever -- I use one 
>> that 
>> stores tasks to an h2 database, but it is not general enough to contribute 
>> back...
>> 
>> I think Migfa was looking at replacing the droids Queue interface with a 
>> standard java.util.Queue interface
>> 
>> ryan
>> 
>> 
>> On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote:
>> 
>>> I think the current implementation only provides in memory queues of tasks. 
>> However, since the TaskQueue interface is relatively simple it shouldn't be 
>> too 
>> hard to persists the data on the disk or to implement a TaskQueue which 
>> works 
>> with a JMS broker or something else.
>>> 
>>> 
>>> On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I haven't looked at the sources.  But who stores items put in the Queue?  
>>>> Are 
>> they in memory, or does something write them to disk, or something else?
>>>> 
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>>>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>>> 
>>> 
> 

Reply via email to