[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412627#comment-13412627
 ] 

Christian Moen commented on SOLR-3524:
--

Patch updated due to recent configuration changes.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412628#comment-13412628
 ] 

Christian Moen commented on SOLR-3524:
--

Committed revision 1360592 on {{trunk}}

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412659#comment-13412659
 ] 

Christian Moen commented on SOLR-3524:
--

Committed revision 1360613 on {{branch_4x}}

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412685#comment-13412685
 ] 

Christian Moen commented on SOLR-3524:
--

{{CHANGES.txt}} for some reason didn't make it into {{branch_4x}}.  Fixed this 
in revision 1360622.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-06 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407873#comment-13407873
 ] 

Christian Moen commented on SOLR-3524:
--

I'll commit this to {{trunk}} and {{branch_4x}} soon.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291635#comment-13291635
 ] 

Christian Moen commented on SOLR-3524:
--

Hiraga-san, there are different views on how punctuation characters best are 
handled by tokenizers.  Punctuation characters generally don't convey much 
meaning useful for text search, so they are generally removed in Lucene. (A 
different point of view is that tokenizers shouldn't remove punctuations and 
that filters should do this.)

The ability to keep punctuation was left as an expert-feature in 
JapanseTokenizer and I think we can expose this as an expert feature in Solr as 
well.  Could you share some details on your use-case just so that I get a 
better idea of the background and importance of this?


  


 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291643#comment-13291643
 ] 

Christian Moen commented on SOLR-3524:
--

Ohtani-san, thanks for the patch!

I've tried it on {{trunk}} and applying it fails because of an 
{{InitializationException}} is thrown instead of a {{SolrException}}.  I'll 
correct this shortly.

We also need some tests here...

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Jun Ohtani (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291787#comment-13291787
 ] 

Jun Ohtani commented on SOLR-3524:
--

Hi Christian,

Sorry, I create the patch based ver. 3.6.0.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291792#comment-13291792
 ] 

Christian Moen commented on SOLR-3524:
--

No trouble.  I'll provide a new patch shortly for {{trunk}} and {{branch_4x}} 
with a test as well.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291807#comment-13291807
 ] 

Christian Moen commented on SOLR-3524:
--

New patch with tests and documentation changes attached.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291836#comment-13291836
 ] 

Kazuaki Hiraga commented on SOLR-3524:
--

Thank you guys!
Christian, Since some documents have keywords that consists of alphabet and 
punctuation such as c++, c# and so on, We want to match those keywords with the 
keyword that unchanged form. Of course, we will discard punctuation in many 
cases but some cases, especially short text, we want to preserve punctuation. 
Therefore, I want to have an option that I can control this behaviour.

Ohtani-san, thank you for your early reply and patch! 

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org