[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

DM Smith (JIRA) Tue, 09 Nov 2010 07:15:19 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119
 ]


DM Smith commented on LUCENE-2747:
----------------------------------

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code 
rather than the jira comments alone. The old StandardAnalyzer had a kitchen 
sink approach to tokenizations trying to do too much with *modern* constructs, 
e.g. URLs, email addresses, acronyms.... It and SimpleAnalyzer would produce 
about the same stream on "old" English and some other texts, but the 
StandardAnalyzer was much slower. (I don't remember how slow, but it was 
obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I 
could take the language specific tokenizers, lists of stop words, stemmers and 
create variations of the SimpleAnalyzer that properly handled a particular 
language. (I created my own analyzers because I wanted to make stop words and 
stemming optional)

In looking at the code in trunk (should have done that before making my 
comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's 
jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer 
does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer 
will no longer handle modern constructs. As I see it this is what 
SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my 
confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer 
be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin 
ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such 
that all Analyzers using StandardTokenizer would get the alternate 
implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Reply via email to