Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Steve Rowe Mon, 14 Aug 2017 11:19:49 -0700

Hi Arslan,

UAX29URLEmailTokenizerImpl.jflex includes ASCIITLD.jflex-macro, which has this 
at the end:


> ) "."?   // Accept trailing root (empty) domain

So trailing dots are recognized as part of domains that are included in URLs 
and email addresses.  But maybe they shouldn’t be?  (Except maybe in a URL that 
contains trailing elements: port/path/query/fragment; in that case the trailing 
dot should definitely be recognized.)

I’m not sure why apostrophe and trailing-dot recognition depends on where they 
occur.  This is not intentional IIRC.

--
Steve
www.lucidworks.com

> On Aug 12, 2017, at 1:46 PM, Ahmet Arslan <[email protected]> wrote:
> 
> Hi,
> 
> I extracted Emails and URLs from certain TREC collections using 
> TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
> 
> High Freq. terms reveal that 
>  * some e-mail addressed start with apostrophes 
>  * some e-mails or URLs end with a period. 
> 
> I ran a few tests and this behaviour occurs only if the entity is the first 
> or last term in the text.
> If the entity is the middle of the text, UAXURLET strips apostrophes and dots.
> 
> For example, "Contact me at [email protected]. or 
> [email protected]." 
> will produce [email protected].  [email protected]
> Notice first email has a dot, while second has not.
> 
> Why UAXURLET behaves different for the first/last token? Could this be a bug?
> 
> It looks like dot and apostrophes  are legal parts of the entities but with 
> this
> abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.
> 
> I created 8 test cases to get your opinions for this one, before creating a 
> Jira issue.
> 
>  public void testURLEndingWithDot2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are 
> www.apache.org. and lucene.apache.org",
>         new String[] {"My","Web","addresses", 
> "are","www.apache.org","and","lucene.apache.org"},
>         new String[] 
> {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
>   }
> 
> public void testEMailStartingWithApostrophe2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[email protected] 
> '[email protected].",
>         new String[] {"[email protected]","[email protected]"},
>         new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
>   }
> 
> 
> P.S. I observed somehow similar phenomena with ICU tokenizer. 
> ICU tokenizer sets script attribute to Latin for words that consist of 
> numbers.
> But if the whole text is composed of words that consist of numbers, script 
> attribute is set to Common.
> 
> Thanks,
> Ahmet
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Reply via email to