About what you see with ICU: it is correct, you have to make sure you
handle "Common":

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java

It mostly behaves like
http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UScriptRun.html
as far as how it classifies runs of text, except for the differences
in the documentation.

On Sat, Aug 12, 2017 at 1:46 PM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote:
> Hi,
>
> I extracted Emails and URLs from certain TREC collections using
> TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
>
> High Freq. terms reveal that
>  * some e-mail addressed start with apostrophes
>  * some e-mails or URLs end with a period.
>
> I ran a few tests and this behaviour occurs only if the entity is the first
> or last term in the text.
> If the entity is the middle of the text, UAXURLET strips apostrophes and
> dots.
>
> For example, "Contact me at java-u...@lucene.apache.org. or
> dev@lucene.apache.org."
> will produce java-u...@lucene.apache.org.  dev@lucene.apache.org
> Notice first email has a dot, while second has not.
>
> Why UAXURLET behaves different for the first/last token? Could this be a
> bug?
>
> It looks like dot and apostrophes  are legal parts of the entities but with
> this
> abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.
>
> I created 8 test cases to get your opinions for this one, before creating a
> Jira issue.
>
>  public void testURLEndingWithDot2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are
> www.apache.org. and lucene.apache.org",
>         new String[] {"My","Web","addresses",
> "are","www.apache.org","and","lucene.apache.org"},
>         new String[]
> {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
>   }
>
> public void testEMailStartingWithApostrophe2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "'g...@usgs.gov
> 'cber_i...@a1.cber.fda.gov.",
>         new String[] {"g...@usgs.gov","cber_i...@a1.cber.fda.gov"},
>         new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
>   }
>
>
> P.S. I observed somehow similar phenomena with ICU tokenizer.
> ICU tokenizer sets script attribute to Latin for words that consist of
> numbers.
> But if the whole text is composed of words that consist of numbers, script
> attribute is set to Common.
>
> Thanks,
> Ahmet
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to