[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-11-30 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032556#comment-15032556
 ] 

Hoss Man commented on LUCENE-6737:
--

I think there may be a bug here for some digits ... created new issue 
LUCENE-6914 in case it's non trivial to fix and doesn't get resolved before 5.4 
is released.

> Add DecimalDigitFilter
> --
>
> Key: LUCENE-6737
> URL: https://issues.apache.org/jira/browse/LUCENE-6737
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Robert Muir
> Fix For: Trunk, 5.4
>
> Attachments: LUCENE-6737.patch
>
>
> TokenFilter that folds all unicode digits 
> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
>  to 0-9.
> Historically a lot of the impacted analyzers couldn't even tokenize numbers 
> at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
> its usually the case you will find e.g. a mix of both ascii digits and 
> "native" digits, and today that makes searching difficult.
> Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
> So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032673#comment-15032673
 ] 

Uwe Schindler commented on LUCENE-6737:
---

Just as idea: We could expand UnicodeData.java autogen'd to ICU extracted 
digits like UnicodeWhitespaceTokenizer? Just in case that the Java data may be 
strange (although I think it is a bug in the filter, as Hoss' said).

> Add DecimalDigitFilter
> --
>
> Key: LUCENE-6737
> URL: https://issues.apache.org/jira/browse/LUCENE-6737
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Robert Muir
> Fix For: Trunk, 5.4
>
> Attachments: LUCENE-6737.patch
>
>
> TokenFilter that folds all unicode digits 
> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
>  to 0-9.
> Historically a lot of the impacted analyzers couldn't even tokenize numbers 
> at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
> its usually the case you will find e.g. a mix of both ascii digits and 
> "native" digits, and today that makes searching difficult.
> Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
> So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032677#comment-15032677
 ] 

Uwe Schindler commented on LUCENE-6737:
---

Ignore my last comment: The filter needs more Unicode info than 
Character#isDigit().

> Add DecimalDigitFilter
> --
>
> Key: LUCENE-6737
> URL: https://issues.apache.org/jira/browse/LUCENE-6737
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Robert Muir
> Fix For: Trunk, 5.4
>
> Attachments: LUCENE-6737.patch
>
>
> TokenFilter that folds all unicode digits 
> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
>  to 0-9.
> Historically a lot of the impacted analyzers couldn't even tokenize numbers 
> at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
> its usually the case you will find e.g. a mix of both ascii digits and 
> "native" digits, and today that makes searching difficult.
> Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
> So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-08-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697014#comment-14697014
 ] 

ASF subversion and git services commented on LUCENE-6737:
-

Commit 1695908 from [~rcmuir] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1695908 ]

LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin

 Add DecimalDigitFilter
 --

 Key: LUCENE-6737
 URL: https://issues.apache.org/jira/browse/LUCENE-6737
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6737.patch


 TokenFilter that folds all unicode digits 
 (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
  to 0-9.
 Historically a lot of the impacted analyzers couldn't even tokenize numbers 
 at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
 its usually the case you will find e.g. a mix of both ascii digits and 
 native digits, and today that makes searching difficult.
 Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
 So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-08-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696702#comment-14696702
 ] 

Uwe Schindler commented on LUCENE-6737:
---

+1

 Add DecimalDigitFilter
 --

 Key: LUCENE-6737
 URL: https://issues.apache.org/jira/browse/LUCENE-6737
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6737.patch


 TokenFilter that folds all unicode digits 
 (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
  to 0-9.
 Historically a lot of the impacted analyzers couldn't even tokenize numbers 
 at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
 its usually the case you will find e.g. a mix of both ascii digits and 
 native digits, and today that makes searching difficult.
 Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
 So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-08-14 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696683#comment-14696683
 ] 

Adrien Grand commented on LUCENE-6737:
--

+1

 Add DecimalDigitFilter
 --

 Key: LUCENE-6737
 URL: https://issues.apache.org/jira/browse/LUCENE-6737
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6737.patch


 TokenFilter that folds all unicode digits 
 (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
  to 0-9.
 Historically a lot of the impacted analyzers couldn't even tokenize numbers 
 at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
 its usually the case you will find e.g. a mix of both ascii digits and 
 native digits, and today that makes searching difficult.
 Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
 So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-08-14 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697719#comment-14697719
 ] 

Ramkumar Aiyengar commented on LUCENE-6737:
---

ICU folding does this right? This patch is still useful even if so, in case you 
don't want to do the full folding, or don't want to use ICU, just curious 
really..

 Add DecimalDigitFilter
 --

 Key: LUCENE-6737
 URL: https://issues.apache.org/jira/browse/LUCENE-6737
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6737.patch


 TokenFilter that folds all unicode digits 
 (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
  to 0-9.
 Historically a lot of the impacted analyzers couldn't even tokenize numbers 
 at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
 its usually the case you will find e.g. a mix of both ascii digits and 
 native digits, and today that makes searching difficult.
 Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
 So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-08-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698043#comment-14698043
 ] 

Robert Muir commented on LUCENE-6737:
-

It does, among other dangerous foldings you may not want. Additionally, it cant 
improve the behaviour for all these languages Analyzers as icu is optional. So 
this is just a simple filter like Lowercase to improve the situation.

 Add DecimalDigitFilter
 --

 Key: LUCENE-6737
 URL: https://issues.apache.org/jira/browse/LUCENE-6737
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6737.patch


 TokenFilter that folds all unicode digits 
 (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
  to 0-9.
 Historically a lot of the impacted analyzers couldn't even tokenize numbers 
 at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
 its usually the case you will find e.g. a mix of both ascii digits and 
 native digits, and today that makes searching difficult.
 Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
 So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6737) Add DecimalDigitFilter

2015-08-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696974#comment-14696974
 ] 

ASF subversion and git services commented on LUCENE-6737:
-

Commit 1695898 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1695898 ]

LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin

 Add DecimalDigitFilter
 --

 Key: LUCENE-6737
 URL: https://issues.apache.org/jira/browse/LUCENE-6737
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6737.patch


 TokenFilter that folds all unicode digits 
 (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:])
  to 0-9.
 Historically a lot of the impacted analyzers couldn't even tokenize numbers 
 at all, but now they use standardtokenizer for numbers/alphanum tokens. But 
 its usually the case you will find e.g. a mix of both ascii digits and 
 native digits, and today that makes searching difficult.
 Note this only impacts *decimal* digits, hence the name DecimalDigitFilter. 
 So no processing of chinese numerals or anything crazy like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org