[
https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Rowe updated LUCENE-2400:
--------------------------------
Attachment: LUCENE-2400.patch
Patch implementing the above-described changes, along with tests confirming
that all-filler shingles/unigrams are no longer output. A new term attribute
called FillerAttribute is defined to mark whether enqueued terms are filler
terms.
Unfortunately, these changes cause a roughly 22% slowdown - contrib/benchmark
numbers for the shingle alg (I got similar numbers for Java 1.5):
JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561
||Max Shingle
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.04s|3.33s|2.05s|-22.6%|
|2|yes|3.23s|3.49s|2.05s|-18.0%|
|4|no|4.00s|4.56s|2.05s|-22.2%|
|4|yes|4.13s|4.72s|2.05s|-22.0%|
> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from
> TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2400
> URL: https://issues.apache.org/jira/browse/LUCENE-2400
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 3.0.1
> Reporter: Steven Rowe
> Priority: Minor
> Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater
> than one, filler tokens are inserted for each position for which there is no
> token in the input token stream. As a result, unigrams (if configured) and
> shingles can be filler-only. Filler-only output tokens make no sense - these
> should be removed.
> Also, because TermAttribute has been deprecated in favor of
> CharTermAttribute, the patch will also convert TermAttribute usages to
> CharTermAttribute in ShingleFilter.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]