[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Steven Rowe (JIRA) Sun, 18 Apr 2010 12:04:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858326#action_12858326
 ]


Steven Rowe commented on LUCENE-2400:
-------------------------------------

I tried adding specialized versions of 
CharTermAttribute.append(StringBuilder,...):


{code:java}public CharTermAttribute append(StringBuilder builder) {
  return append(builder, 0, builder.length());
}
public CharTermAttribute append(StringBuilder builder, int start, int end) {
  int newTermLength = termLength + end - start;
  resizeBuffer(newTermLength);
  builder.getChars(start, end, termBuffer, termLength);
  termLength = newTermLength;
  return this;
}
{code}

This helped a little bit, but it's still slower than the fully-spelled-out 
CharTermAttribute setting code that was previously in place:

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle 
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.08s|3.26s|2.11s|-15.6%|
|2|yes|3.26s|3.41s|2.11s|-11.4%|
|4|no|4.05s|4.49s|2.11s|-18.4%|
|4|yes|4.17s|4.64s|2.11s|-18.5%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from 
> TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater 
> than one, filler tokens are inserted for each position for which there is no 
> token in the input token stream.  As a result, unigrams (if configured) and 
> shingles can be filler-only.  Filler-only output tokens make no sense - these 
> should be removed.
> Also, because TermAttribute has been deprecated in favor of 
> CharTermAttribute, the patch will also convert TermAttribute usages to 
> CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Reply via email to