[
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929556#action_12929556
]
M Alexander commented on LUCENE-2745:
-------------------------------------
{quote}
I think that ArabicLetterTokenizer, which is the tokenizer used by
ArabicAnalyzer, is obsolete (as of version 3.1), since StandardTokenizer, which
implements the Unicode word segmentation rules from UAX#29, should be able to
properly tokenize Arabic. StandardTokenizer recognizes email addresses,
hostnames, and URLs, so your concern would be addressed. (See LUCENE-2167,
though, which was just reopened to turn off full URL output.)
You can test this by composing your own analyzer, if you're willing to try
using using as-yet-unreleased branch_3X, from which 3.1 will be cut (hopefully
fairly soon): just copy ArabicAnalyzer class and swap in StandardTokenizer for
ArabicLetterTokenizer
{quote}
I tried to test this and failed (miserably). I think I struggled to patch
LUCENE-2167 correctly through my eclipse. I might just wait for branch_3X
release to make my life easier. I will then create my own Analyzer to perform
Arabic Text Analysis and another one for Farsi Text Analysis. Both Analyzers
will have the ability to handle diacritics as well as email addresses,
hostnames and so on. I will colse this issue for now (will re-open in the
future if needed).
Quick question - any thoughts of handling Arabic email addresses and hostnames
in the future?
Thanks to both of you for the time taken and I shall wait for the branch
release to solve my issue.
> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> ------------------------------------------------------------------------------
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
> Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on.
> For example,
> [email protected]
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to
> [[email protected]]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]