Sergio Leoni created LUCENE-8044:
------------------------------------

             Summary: UAX_URL_EMAIL tokenizer not compliant to rfc1808
                 Key: LUCENE-8044
                 URL: https://issues.apache.org/jira/browse/LUCENE-8044
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/other
    Affects Versions: 6.6
         Environment: Elasticsearch 5.5.2, Build: 
b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_144

JVM java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

OS Linux 3.10.0-514.10.2.el7.x86_64 #1 SMP Mon Feb 20 02:37:52 EST 2017 x86_64 
x86_64 x86_64 GNU/Linux
            Reporter: Sergio Leoni
            Priority: Minor


I noticed that the uax_url_email tokenizer splits urls in multiple tokens in 
presence of digits, ".", "-"

I opened a issue on elasticsearch github repo 
(https://github.com/elastic/elasticsearch/issues/27309) because I noticed this 
strange behaviour.

Their answer was 
{quote}
The uax_url_email tokenizer tokenizes URLs and email addresses, but in order to 
recognize a token as a URL it must include the scheme (usually HTTP:// or 
https://):
Additionally, this tokenizer belongs to Lucene. Could you open this issue at 
https://lucene.apache.org/core/ instead?
{quote}

URLs are defined by RFC1738 and extended by RFC1808. 
In RFC1808 Relative URLs are explained, and this allows scheme-less URLs.
I would expect uax_url_email to tokenize correctly also scheme-less and 
relative URL.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to