Because parsing the outlink of each fetched url with only one thread is too slow when I am detecting the language of the content of those outlinks, so I would like to share the load between multiple threads as Fetcher does...
Funtick wrote: > > Not sure about multithreading: > - Parsing is CPU-bound > - In case of 4-core we need 3-4 threads at most > - Map/Reduce can be configured with 3-4 Reducers and use 3-4 cores > > Why multithreading? > > (with Map/Reduce on Hadoop multithreading is necessity for fetching pages > from Internet, Fetcher only...) > > > Fuad Efendi > +1 416-993-2060 > http://www.linkedin.com/in/liferay > > Tokenizer Inc. > http://www.tokenizer.ca/ > Data Mining, Vertical Search > >> -----Original Message----- >> From: Santiago Pérez [mailto:elara...@gmail.com] >> Sent: December-28-09 6:09 AM >> To: nutch-dev@lucene.apache.org >> Subject: Mutithreaded parsing >> >> >> Hej, >> >> I am developping a modification in Nutch for only accepting outlinks of >> Spanish url. I have implemented downloading and parsing the content of >> each >> outlink (in ParseOutFormat) with Jericho and detecting the language with >> Lingpipe. >> >> This proccess seems too heavy, especially because it is done by only one >> thread, so I would thank any idea for: >> >> Any easier way for detecting the language of an outlink? >> Any way for performing a multithreaded outlink extraction as fetcher >> does? >> >> Thanks in advance >> -- >> View this message in context: http://old.nabble.com/Mutithreaded-parsing- >> tp26941947p26941947.html >> Sent from the Nutch - Dev mailing list archive at Nabble.com. > > > > > -- View this message in context: http://old.nabble.com/Mutithreaded-parsing-tp26941947p26947396.html Sent from the Nutch - Dev mailing list archive at Nabble.com.