Hi Arslan, UAX29URLEmailTokenizerImpl.jflex includes ASCIITLD.jflex-macro, which has this at the end:
> ) "."? // Accept trailing root (empty) domain So trailing dots are recognized as part of domains that are included in URLs and email addresses. But maybe they shouldn’t be? (Except maybe in a URL that contains trailing elements: port/path/query/fragment; in that case the trailing dot should definitely be recognized.) I’m not sure why apostrophe and trailing-dot recognition depends on where they occur. This is not intentional IIRC. -- Steve www.lucidworks.com > On Aug 12, 2017, at 1:46 PM, Ahmet Arslan <[email protected]> wrote: > > Hi, > > I extracted Emails and URLs from certain TREC collections using > TestUAX29URLEmailTokenizer combined with TypeTokenFilter. > > High Freq. terms reveal that > * some e-mail addressed start with apostrophes > * some e-mails or URLs end with a period. > > I ran a few tests and this behaviour occurs only if the entity is the first > or last term in the text. > If the entity is the middle of the text, UAXURLET strips apostrophes and dots. > > For example, "Contact me at [email protected]. or > [email protected]." > will produce [email protected]. [email protected] > Notice first email has a dot, while second has not. > > Why UAXURLET behaves different for the first/last token? Could this be a bug? > > It looks like dot and apostrophes are legal parts of the entities but with > this > abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL. > > I created 8 test cases to get your opinions for this one, before creating a > Jira issue. > > public void testURLEndingWithDot2() throws IOException { > BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are > www.apache.org. and lucene.apache.org", > new String[] {"My","Web","addresses", > "are","www.apache.org","and","lucene.apache.org"}, > new String[] > {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"}); > } > > public void testEMailStartingWithApostrophe2() throws IOException { > BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[email protected] > '[email protected].", > new String[] {"[email protected]","[email protected]"}, > new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"}); > } > > > P.S. I observed somehow similar phenomena with ICU tokenizer. > ICU tokenizer sets script attribute to Latin for words that consist of > numbers. > But if the whole text is composed of words that consist of numbers, script > attribute is set to Common. > > Thanks, > Ahmet > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
