[ 
https://issues.apache.org/jira/browse/LUCENE-8278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492160#comment-16492160
 ] 

Steve Rowe commented on LUCENE-8278:
------------------------------------

I've attached a fully-regenerated patch (which is why it's so big...) against 
the master branch for a fix I cooked up.  In this change, the TLD macro 
generator partitions TLDs by whether they are prefixes of other TLDs, and by 
suffix length, and then the grammar tries the longest TLDs first, falling back 
one suffix char at a time.  Currently there are only 3 buckets: 

# None of the TLDs is a 1-character-shorter prefix of another TLD
# Each TLD is a prefix of another TLD by 1 character
# Each TLD is a prefix of another TLD by 2 characters

The TLD macro generator does not hard code the number of buckets, so it should 
be able to handle future TLD prefixes with suffixes of more than 2 characters. 

I've added a test for {{example.TLD}} URLs at end-of-input for all TLDs, and it 
passes, as do all other tests in the analyzers-common module.

FYI, the fix here was complicated by the fact that JFlex doesn't support 
end-of-input assertion (like Java's {{\z}}) as part of a lexical rule: the 
{{<<EOF>>}} rule can't be combined with a regex, and zero-length lookahead 
assertions must match at least one character.

[~drjz], can you test this in your context?

> UAX29URLEmailTokenizer is not detecting some tokens as URL type
> ---------------------------------------------------------------
>
>                 Key: LUCENE-8278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8278
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Junte Zhang
>            Priority: Minor
>         Attachments: LUCENE-8278.patch
>
>
> We are using the UAX29URLEmailTokenizer so we can use the token types in our 
> plugins.
> However, I noticed that the tokenizer is not detecting certain URLs as <URL> 
> but <ALPHANUM> instead.
> Examples that are not working:
>  * example.com is <ALPHANUM>
>  * example.net is <ALPHANUM>
> But:
>  * https://example.com is <URL>
>  * as is https://example.net
> Examples that work:
>  * example.ch is <URL>
>  * example.co.uk is <URL>
>  * example.nl is <URL>
> I have checked this JIRA, and could not find an issue. I have tested this on 
> Lucene (Solr) 6.4.1 and 7.3.
> Could someone confirm my findings and advise what I could do to (help) 
> resolve this issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to