[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032556#comment-15032556 ] Hoss Man commented on LUCENE-6737: -- I think there may be a bug here for some digits ... created new issue LUCENE-6914 in case it's non trivial to fix and doesn't get resolved before 5.4 is released. > Add DecimalDigitFilter > -- > > Key: LUCENE-6737 > URL: https://issues.apache.org/jira/browse/LUCENE-6737 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Robert Muir > Fix For: Trunk, 5.4 > > Attachments: LUCENE-6737.patch > > > TokenFilter that folds all unicode digits > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) > to 0-9. > Historically a lot of the impacted analyzers couldn't even tokenize numbers > at all, but now they use standardtokenizer for numbers/alphanum tokens. But > its usually the case you will find e.g. a mix of both ascii digits and > "native" digits, and today that makes searching difficult. > Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. > So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032673#comment-15032673 ] Uwe Schindler commented on LUCENE-6737: --- Just as idea: We could expand UnicodeData.java autogen'd to ICU extracted digits like UnicodeWhitespaceTokenizer? Just in case that the Java data may be strange (although I think it is a bug in the filter, as Hoss' said). > Add DecimalDigitFilter > -- > > Key: LUCENE-6737 > URL: https://issues.apache.org/jira/browse/LUCENE-6737 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Robert Muir > Fix For: Trunk, 5.4 > > Attachments: LUCENE-6737.patch > > > TokenFilter that folds all unicode digits > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) > to 0-9. > Historically a lot of the impacted analyzers couldn't even tokenize numbers > at all, but now they use standardtokenizer for numbers/alphanum tokens. But > its usually the case you will find e.g. a mix of both ascii digits and > "native" digits, and today that makes searching difficult. > Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. > So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032677#comment-15032677 ] Uwe Schindler commented on LUCENE-6737: --- Ignore my last comment: The filter needs more Unicode info than Character#isDigit(). > Add DecimalDigitFilter > -- > > Key: LUCENE-6737 > URL: https://issues.apache.org/jira/browse/LUCENE-6737 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Robert Muir > Fix For: Trunk, 5.4 > > Attachments: LUCENE-6737.patch > > > TokenFilter that folds all unicode digits > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) > to 0-9. > Historically a lot of the impacted analyzers couldn't even tokenize numbers > at all, but now they use standardtokenizer for numbers/alphanum tokens. But > its usually the case you will find e.g. a mix of both ascii digits and > "native" digits, and today that makes searching difficult. > Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. > So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697014#comment-14697014 ] ASF subversion and git services commented on LUCENE-6737: - Commit 1695908 from [~rcmuir] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1695908 ] LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin Add DecimalDigitFilter -- Key: LUCENE-6737 URL: https://issues.apache.org/jira/browse/LUCENE-6737 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Fix For: Trunk, 5.4 Attachments: LUCENE-6737.patch TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9. Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and native digits, and today that makes searching difficult. Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696702#comment-14696702 ] Uwe Schindler commented on LUCENE-6737: --- +1 Add DecimalDigitFilter -- Key: LUCENE-6737 URL: https://issues.apache.org/jira/browse/LUCENE-6737 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Fix For: Trunk, 5.4 Attachments: LUCENE-6737.patch TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9. Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and native digits, and today that makes searching difficult. Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696683#comment-14696683 ] Adrien Grand commented on LUCENE-6737: -- +1 Add DecimalDigitFilter -- Key: LUCENE-6737 URL: https://issues.apache.org/jira/browse/LUCENE-6737 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Fix For: Trunk, 5.4 Attachments: LUCENE-6737.patch TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9. Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and native digits, and today that makes searching difficult. Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697719#comment-14697719 ] Ramkumar Aiyengar commented on LUCENE-6737: --- ICU folding does this right? This patch is still useful even if so, in case you don't want to do the full folding, or don't want to use ICU, just curious really.. Add DecimalDigitFilter -- Key: LUCENE-6737 URL: https://issues.apache.org/jira/browse/LUCENE-6737 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Fix For: Trunk, 5.4 Attachments: LUCENE-6737.patch TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9. Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and native digits, and today that makes searching difficult. Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698043#comment-14698043 ] Robert Muir commented on LUCENE-6737: - It does, among other dangerous foldings you may not want. Additionally, it cant improve the behaviour for all these languages Analyzers as icu is optional. So this is just a simple filter like Lowercase to improve the situation. Add DecimalDigitFilter -- Key: LUCENE-6737 URL: https://issues.apache.org/jira/browse/LUCENE-6737 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Fix For: Trunk, 5.4 Attachments: LUCENE-6737.patch TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9. Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and native digits, and today that makes searching difficult. Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter
[ https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696974#comment-14696974 ] ASF subversion and git services commented on LUCENE-6737: - Commit 1695898 from [~rcmuir] in branch 'dev/trunk' [ https://svn.apache.org/r1695898 ] LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin Add DecimalDigitFilter -- Key: LUCENE-6737 URL: https://issues.apache.org/jira/browse/LUCENE-6737 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Fix For: Trunk, 5.4 Attachments: LUCENE-6737.patch TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9. Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and native digits, and today that makes searching difficult. Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org