[ http://issues.apache.org/jira/browse/NUTCH-134?page=all ]
Jerome Charron updated NUTCH-134: --------------------------------- Attachment: summarizer.060506.patch Here is a patch that add a summarizer extension point and two summarizer plugins : summarizer-basic (the current nutch implementation) and summarizer-lucene (the lucene highlighter implementation). Please notice that the lucene plugin is a very crude implementation : the highlighter directly constructs a text representation of the summary, so we need to parse the text to build a Summary object!!! (improvements are welcome). This is a first step to this issue resolution. If no objection, I will commit this patch in the next few days and then: 1. Fix in the summarizer-basic the original issue reported by Andrzej 2. Add a toString(Encoder, Formatter) method in Summarizer so that a Summary object could be encoded and formatted with many implementations (it is the same logic as the one in Lucene Highlight) - Andrzej, do you prefer this solution or a solution where Summary is Writable? PS: Chris, sorry but the major part of this patch was already done when you added your comment. > Summarizer doesn't select the best snippets > ------------------------------------------- > > Key: NUTCH-134 > URL: http://issues.apache.org/jira/browse/NUTCH-134 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev > Reporter: Andrzej Bialecki > Attachments: summarizer.060506.patch > > Summarizer.java tries to select the best fragments from the input text, where > the frequency of query terms is the highest. However, the logic in line 223 > is flawed in that the excerptSet.add() operation will add new excerpts only > if they are not already present - the test is performed using the Comparator > that compares only the numUniqueTokens. This means that if there are two or > more excerpts, which score equally high, only the first of them will be > retained, and the rest of equally-scoring excerpts will be discarded, in > favor of other excerpts (possibly lower scoring). > To fix this the Set should be replaced with a List + a sort operation. To > keep the relative position of excerpts in the original order the Excerpt > class should be extended with an "int order" field, and the collected > excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira