Not sure about multithreading: - Parsing is CPU-bound - In case of 4-core we need 3-4 threads at most - Map/Reduce can be configured with 3-4 Reducers and use 3-4 cores
Why multithreading? (with Map/Reduce on Hadoop multithreading is necessity for fetching pages from Internet, Fetcher only...) Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search > -----Original Message----- > From: Santiago Pérez [mailto:elara...@gmail.com] > Sent: December-28-09 6:09 AM > To: nutch-dev@lucene.apache.org > Subject: Mutithreaded parsing > > > Hej, > > I am developping a modification in Nutch for only accepting outlinks of > Spanish url. I have implemented downloading and parsing the content of > each > outlink (in ParseOutFormat) with Jericho and detecting the language with > Lingpipe. > > This proccess seems too heavy, especially because it is done by only one > thread, so I would thank any idea for: > > Any easier way for detecting the language of an outlink? > Any way for performing a multithreaded outlink extraction as fetcher does? > > Thanks in advance > -- > View this message in context: http://old.nabble.com/Mutithreaded-parsing- > tp26941947p26941947.html > Sent from the Nutch - Dev mailing list archive at Nabble.com.