Andrzej Bialecki wrote:

Hi,

I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. This method operates on a LinkedList, which seems to be a huge bottleneck. Perhaps it would be possible to replace LinkedList with a table?

Nutch Summarizer also needlessly re-tokenizes the text over and over again - perhaps it would be better to save already tokenized text in parse_text, instead of the raw plain text? After all, the only use for that text is to index it and then build the summaries.

Please see the profiles here:

   http://www.getopt.org/nutch/profile/index.html


Further input into this: after replacing the ConjunctionScorer with the
fixed version from JIRA, now the bottleneck seems to be ... in
Summarizer, of all things. :-)

I'm loading the DistributedSearch$Server to 100% CPU, and then the split
is as follows:

* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens()
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC

which is slightly ridiculuous... I think this makes a good case for
storing pre-tokenized text in segments.

Regarding the allocation hot spots, we have the following top entries:

* 19.1% - 22,109 kB - 535,903 alloc.
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc.
org.apache.nutch.analysis.CommonGrams$Filter.next
-> 29.6% - 34,380 kB - 717,713 alloc.
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms

It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this
stage we shouldn't need any re-tokenization except for the user query...
so I would argue that these parts should be redesigned to store and
retrieve pre-tokenized values.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to