[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Steven Rowe (JIRA) Mon, 10 May 2010 09:36:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865815#action_12865815
 ]


Steven Rowe commented on LUCENE-2167:
-------------------------------------

{quote}
bq. Naming will require some thought, though - I don't like EnglishTokenizer or 
EuropeanTokenizer - both seem to exclude valid constituencies.

What valid constituencies do you refer to?
{quote}

Well, we can't call it English/EuropeanTokenizer (maybe 
EnglishAndEuropeanAnalyzer?  seems too long), and calling it either only 
English or only European seems to leave the other out.  Americans, e.g., don't 
consider themselves European, maybe not even linguistically (however incorrect 
that might be).

bq. In general the acronym,company,possessive stuff here are very 
english/euro-specific.

Right, I agree.  I'm just looking for a name that covers the languages of 
interest unambiguously.  WesternTokenizer?  (but "I live east of the Rockies - 
can I use WesternTokenizer?"...)  Maybe EuropeanLanguagesTokenizer?  The 
difficulty as I see it is the messy intersection between political, geographic, 
and linguistic boundaries.

bq. Bugs in JIRA get opened if it doesn't do this stuff right on english, but 
it doesn't even work at all for a lot of languages.  Personally I think its 
great to rip this stuff out of what should be a "default" language-independent 
tokenizer based on standards (StandardTokenizer), and put it into the 
language-specific package that it belongs. Otherwise we have to worry about 
these sort of things overriding and screwing up UAX#29 rules for words in real 
languages.

I assume you don't mean to say that English and European languages are not real 
languages :) .

{quote}
bq. What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, 
and Japanese? (Are there others like these that aren't well served by UAX#29 
without customizations?)

It gets a little tricky: we should be careful about how we interpret what is 
"reasonable" for a language-independent default tokenizer. I think its "enough" 
to output the best indexing unit that is possible and relatively unambiguous to 
identify. I think this is a shortcut we can make, because we are trying to 
tokenize things for information retrieval, not for other purposes. The approach 
for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables 
as indexing unit, since words are ambiguous. Thai is based on words, not 
syllables, in ICUTokenizer, which is inconsistent from this, but we get this 
for free, so its just a laziness thing.
{quote}

I think that StandardTokenizer should contain tailorings for CJK, Thai, Lao, 
Myanmar, and Khmer, then - it should be able to do reasonable things for all 
languages/scripts, to the greatest extent possible.

The English/European tokenizer can then extend StandardTokenizer (conceptually, 
not in the Java sense).

{quote}
bq. I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as 
separate classes - what do you think?

Well, either way I again strongly feel this logic should be tied into 
"Standard" tokenizer, so that it has better unicode behavior. I think it makes 
sense for us to have a reasonable, language-independent, standards-based 
tokenizer that works well for most languages. I think it also makes sense to 
have English/Euro-centric stuff thats language-specific, sitting in the 
analysis.en package just like we
do with other languages.
{quote}

I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable 
so-called StandardFilter to strip out the trailing /'s/, is stupid for all 
non-English languages.

It might be confusing, though, for a (e.g.) Greek user to have to go look at 
the analysis.en package to get reasonable performance for her language.

Maybe an EnglishTokenizer, and separately a EuropeanAnalyzer?  Is that what 
you've been driving at all along??? (Silly me....  Sigh.)

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to