[
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929400#action_12929400
]
Steven Rowe commented on LUCENE-2745:
-------------------------------------
bq. Hunh????
Okay, I think I get it now.
I did a search for U+200C in the whole Lucene project, and I found
TestPersianAnalyzer.
Apparently, Robert, when you said "the whole analyzer" and "this approach" you
meant PersianAnalyzer, rather than ArabicAnalyzer. Sorry for the confusion.
What do you think the approach should be for Persian? Maybe a
StandardTokenizer clone that excludes ZWNJ from the \p{Word_Break:Extend} class
that gets added to every rule? I'll see if there is some way to compose a
PersianTokenizer.jflex (using the %include directive maybe?) using
StandardTokenizerImpl.jflex, so that we don't end up with code duplication.
> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> ------------------------------------------------------------------------------
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
> Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on.
> For example,
> [email protected]
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to
> [[email protected]]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]