[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Steven Rowe (JIRA) Sat, 12 Jun 2010 09:17:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878282#action_12878282
 ]


Steven Rowe commented on LUCENE-2167:
-------------------------------------

bq. by the way Steven, one alternative idea i had before for this was to have a 
jflex or rbbi-powered charfilter for sentences.

nice idea - composition becomes simpler.

{quote}
you could provide it with string constants in the ctor to replace sentence 
boundaries, to add position increments just add these to your stopfilter.

the advantage to this would be that you could use it with other tokenizers by 
using this special token (i guess just be careful which one you use!).
{quote}

Why not just insert {{U+2029 PARAGRAPH SEPARATOR (PS)}}?  Then it will also 
trigger word boundaries, and tokenizers that care about appropriately 
responding to it can specialize for just this one, instead of having to also be 
aware of whatever it was that the user specified in the ctor to the charfilter.

bq. sorry to stray off topic a bit with this, but i think its sorta a missing 
piece thats relevant and becomes more important with ComplexContext

I like where this is going - toward a solid general solution.

{quote}
bq. Lots of text I look at includes newlines that don't indicate paragraph 
boundaries.

What is this text? Some manually-wrapped text?
{quote}

Email.  Source code.  TREC collections (I think - don't have any right here 
with me).  And yes, manually generated and wrapped text.  Isn't most text 
manually generated? :)


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to