[ https://issues.apache.org/jira/browse/LUCENE-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867359#comment-13867359 ]
Steve Rowe edited comment on LUCENE-5391 at 1/10/14 12:52 AM: -------------------------------------------------------------- I understand why "index.php" is not broken up: the <URL> rule matches "index.ph", but the <ALPHANUM> rule has a longer match, so it wins. Conversely, <ALPHANUM> does not match "index2.php" (likely because the [number][period] sequence is not allowed), so the shorter <URL> match is tokenized. Another improperly broken-up filename-looking thing: "index-h.php" - the <URL> rule matches "index-h.ph", but the <ALPHANUM> rule doesn't match (likely because of the hyphen). I think the fix here is to disallow <URL>s when there is no trailing port, path, query or fragment, and the following character is [-A-Za-z0-9] (allowable domain label characters). I'll make a patch. was (Author: steve_rowe): I understand why "index.php" is not broken up: the <URL> rule matches "index.ph", but the <ALPHANUM> rule has a longer match, so it wins. Conversely, <ALPHANUM> does not match "index2.php" (likely because the {[number][period]} sequence is not allowed), so the shorter <URL> match is tokenized. Another improperly broken-up filename-looking thing: "index-h.php" - the <URL> rule matches "index-h.ph", but the <ALPHANUM> rule doesn't match (likely because of the hyphen). I think the fix here is to disallow <URL>s when there is no trailing port, path, query or fragment, and the following character is [-A-Za-z0-9] (allowable domain label characters). I'll make a patch. > uax29urlemailtokenizer - unexpected tokenisation of "index2.php" (and other > inputs) > ----------------------------------------------------------------------------------- > > Key: LUCENE-5391 > URL: https://issues.apache.org/jira/browse/LUCENE-5391 > Project: Lucene - Core > Issue Type: Bug > Reporter: Chris Geeringh > > The uax29urlemailtokenizer tokenises index2.php as: > <URL> index2.ph > <ALPHANUM> p > While it does not do the same for index.php > Screenshot from analyser: http://postimg.org/image/aj6c98n3b/ -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org