Hi,
I extracted Emails and URLs from certain TREC collections using
TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
High Freq. terms reveal that * some e-mail addressed start with apostrophes *
some e-mails or URLs end with a period.
I ran a few tests and this behaviour occurs only if the entity is the first or
last term in the text.If the entity is the middle of the text, UAXURLET strips
apostrophes and dots.
For example, "Contact me at [email protected]. or
[email protected]." will produce [email protected].
[email protected] first email has a dot, while second has not.
Why UAXURLET behaves different for the first/last token? Could this be a bug?
It looks like dot and apostrophes are legal parts of the entities but with this
abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.
I created 8 test cases to get your opinions for this one, before creating a
Jira issue.
public void testURLEndingWithDot2() throws IOException {
BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are
www.apache.org. and lucene.apache.org",
new String[] {"My","Web","addresses",
"are","www.apache.org","and","lucene.apache.org"},
new String[]
{"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
}
public void testEMailStartingWithApostrophe2() throws IOException {
BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[email protected]
'[email protected].",
new String[] {"[email protected]","[email protected]"},
new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
}
P.S. I observed somehow similar phenomena with ICU tokenizer. ICU tokenizer
sets script attribute to Latin for words that consist of numbers.
But if the whole text is composed of words that consist of numbers, script
attribute is set to Common.
Thanks,Ahmet