[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

Steven Rowe (JIRA) Sun, 07 Nov 2010 13:26:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929400#action_12929400
 ]


Steven Rowe commented on LUCENE-2745:
-------------------------------------

bq. Hunh????

Okay, I think I get it now.  

I did a search for U+200C in the whole Lucene project, and I found 
TestPersianAnalyzer.

Apparently, Robert, when you said "the whole analyzer" and "this approach" you 
meant PersianAnalyzer, rather than ArabicAnalyzer.  Sorry for the confusion.

What do you think the approach should be for Persian?  Maybe a 
StandardTokenizer clone that excludes ZWNJ from the \p{Word_Break:Extend} class 
that gets added to every rule?  I'll see if there is some way to compose a 
PersianTokenizer.jflex (using the %include directive maybe?) using 
StandardTokenizerImpl.jflex, so that we don't end up with code duplication.

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
>         Environment: All
>            Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> [email protected]
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [[email protected]]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

Reply via email to