About what you see with ICU: it is correct, you have to make sure you handle "Common":
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java It mostly behaves like http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UScriptRun.html as far as how it classifies runs of text, except for the differences in the documentation. On Sat, Aug 12, 2017 at 1:46 PM, Ahmet Arslan <[email protected]> wrote: > Hi, > > I extracted Emails and URLs from certain TREC collections using > TestUAX29URLEmailTokenizer combined with TypeTokenFilter. > > High Freq. terms reveal that > * some e-mail addressed start with apostrophes > * some e-mails or URLs end with a period. > > I ran a few tests and this behaviour occurs only if the entity is the first > or last term in the text. > If the entity is the middle of the text, UAXURLET strips apostrophes and > dots. > > For example, "Contact me at [email protected]. or > [email protected]." > will produce [email protected]. [email protected] > Notice first email has a dot, while second has not. > > Why UAXURLET behaves different for the first/last token? Could this be a > bug? > > It looks like dot and apostrophes are legal parts of the entities but with > this > abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL. > > I created 8 test cases to get your opinions for this one, before creating a > Jira issue. > > public void testURLEndingWithDot2() throws IOException { > BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are > www.apache.org. and lucene.apache.org", > new String[] {"My","Web","addresses", > "are","www.apache.org","and","lucene.apache.org"}, > new String[] > {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"}); > } > > public void testEMailStartingWithApostrophe2() throws IOException { > BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[email protected] > '[email protected].", > new String[] {"[email protected]","[email protected]"}, > new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"}); > } > > > P.S. I observed somehow similar phenomena with ICU tokenizer. > ICU tokenizer sets script attribute to Latin for words that consist of > numbers. > But if the whole text is composed of words that consist of numbers, script > attribute is set to Common. > > Thanks, > Ahmet > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
