RE: Mutithreaded parsing

Santiago Pérez Mon, 28 Dec 2009 12:24:27 -0800

Because parsing the outlink of each fetched url with only one thread is too
slow when I am detecting the language of the content of those outlinks, so I
would like to share the load between multiple threads as Fetcher does...



Funtick wrote:
> 
> Not sure about multithreading:
> - Parsing is CPU-bound
> - In case of 4-core we need 3-4 threads at most
> - Map/Reduce can be configured with 3-4 Reducers and use 3-4 cores
> 
> Why multithreading?
> 
> (with Map/Reduce on Hadoop multithreading is necessity for fetching pages
> from Internet, Fetcher only...)
> 
> 
> Fuad Efendi
> +1 416-993-2060
> http://www.linkedin.com/in/liferay
> 
> Tokenizer Inc.
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
> 
>> -----Original Message-----
>> From: Santiago Pérez [mailto:elara...@gmail.com]
>> Sent: December-28-09 6:09 AM
>> To: nutch-dev@lucene.apache.org
>> Subject: Mutithreaded parsing
>> 
>> 
>> Hej,
>> 
>> I am developping a modification in Nutch for only accepting outlinks of
>> Spanish url. I have implemented downloading and parsing the content of
>> each
>> outlink (in ParseOutFormat) with Jericho and detecting the language with
>> Lingpipe.
>> 
>> This proccess seems too heavy, especially because it is done by only one
>> thread, so I would thank any idea for:
>> 
>> Any easier way for detecting the language of an outlink?
>> Any way for performing a multithreaded outlink extraction as fetcher
>> does?
>> 
>> Thanks in advance
>> --
>> View this message in context: http://old.nabble.com/Mutithreaded-parsing-
>> tp26941947p26941947.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Mutithreaded-parsing-tp26941947p26947396.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Mutithreaded parsing

Reply via email to