Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Robert Muir Sat, 12 Aug 2017 10:59:32 -0700

About what you see with ICU: it is correct, you have to make sure you
handle "Common":


https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java

It mostly behaves like
http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UScriptRun.html
as far as how it classifies runs of text, except for the differences
in the documentation.

On Sat, Aug 12, 2017 at 1:46 PM, Ahmet Arslan <[email protected]> wrote:
> Hi,
>
> I extracted Emails and URLs from certain TREC collections using
> TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
>
> High Freq. terms reveal that
>  * some e-mail addressed start with apostrophes
>  * some e-mails or URLs end with a period.
>
> I ran a few tests and this behaviour occurs only if the entity is the first
> or last term in the text.
> If the entity is the middle of the text, UAXURLET strips apostrophes and
> dots.
>
> For example, "Contact me at [email protected]. or
> [email protected]."
> will produce [email protected].  [email protected]
> Notice first email has a dot, while second has not.
>
> Why UAXURLET behaves different for the first/last token? Could this be a
> bug?
>
> It looks like dot and apostrophes  are legal parts of the entities but with
> this
> abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.
>
> I created 8 test cases to get your opinions for this one, before creating a
> Jira issue.
>
>  public void testURLEndingWithDot2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are
> www.apache.org. and lucene.apache.org",
>         new String[] {"My","Web","addresses",
> "are","www.apache.org","and","lucene.apache.org"},
>         new String[]
> {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
>   }
>
> public void testEMailStartingWithApostrophe2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[email protected]
> '[email protected].",
>         new String[] {"[email protected]","[email protected]"},
>         new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
>   }
>
>
> P.S. I observed somehow similar phenomena with ICU tokenizer.
> ICU tokenizer sets script attribute to Latin for words that consist of
> numbers.
> But if the whole text is composed of words that consist of numbers, script
> attribute is set to Common.
>
> Thanks,
> Ahmet
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Reply via email to