Andrzej Bialecki wrote:
Hi,
I've been profiling a Nutch installation, and to my surprise the
largest amount of throwaway allocations and the most time spent was
not in Nutch specific code, or IPC, but in Lucene
ConjunctionScorer.doNext() method. This method operates on a
LinkedList, which seems to be a huge bottleneck. Perhaps it would be
possible to replace LinkedList with a table?
Nutch Summarizer also needlessly re-tokenizes the text over and over
again - perhaps it would be better to save already tokenized text in
parse_text, instead of the raw plain text? After all, the only use for
that text is to index it and then build the summaries.
Please see the profiles here:
http://www.getopt.org/nutch/profile/index.html
Further input into this: after replacing the ConjunctionScorer with the
fixed version from JIRA, now the bottleneck seems to be ... in
Summarizer, of all things. :-)
I'm loading the DistributedSearch$Server to 100% CPU, and then the split
is as follows:
* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens()
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC
which is slightly ridiculuous... I think this makes a good case for
storing pre-tokenized text in segments.
Regarding the allocation hot spots, we have the following top entries:
* 19.1% - 22,109 kB - 535,903 alloc.
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc.
org.apache.nutch.analysis.CommonGrams$Filter.next
-> 29.6% - 34,380 kB - 717,713 alloc.
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms
It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this
stage we shouldn't need any re-tokenization except for the user query...
so I would argue that these parts should be redesigned to store and
retrieve pre-tokenized values.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com