[
https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858306#action_12858306
]
Robert Muir commented on LUCENE-2400:
-------------------------------------
bq. Unfortunately, these changes cause a roughly 22% slowdown -
contrib/benchmark numbers for the shingle alg (I got similar numbers for Java
1.5):
Steven, i wonder if this is because of a stupid thing, I noticed this in your
patch:
{noformat}
- shingleBuilder.append(termAtt.termBuffer(), 0, termAtt.termLength());
+ gramBuilder.append(charTermAtt.toString());
{noformat}
i would recommend gramBuilder.append(termAtt.buffer(), 0, termAtt.length())
like before, maybe its just the extra gc cost of creating useless strings?
> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from
> TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2400
> URL: https://issues.apache.org/jira/browse/LUCENE-2400
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 3.0.1
> Reporter: Steven Rowe
> Priority: Minor
> Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater
> than one, filler tokens are inserted for each position for which there is no
> token in the input token stream. As a result, unigrams (if configured) and
> shingles can be filler-only. Filler-only output tokens make no sense - these
> should be removed.
> Also, because TermAttribute has been deprecated in favor of
> CharTermAttribute, the patch will also convert TermAttribute usages to
> CharTermAttribute in ShingleFilter.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]