[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412627#comment-13412627 ] Christian Moen commented on SOLR-3524: -- Patch updated due to recent configuration changes. Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Assignee: Christian Moen Priority: Minor Attachments: SOLR-3524.patch, SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412628#comment-13412628 ] Christian Moen commented on SOLR-3524: -- Committed revision 1360592 on {{trunk}} Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Assignee: Christian Moen Priority: Minor Attachments: SOLR-3524.patch, SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412659#comment-13412659 ] Christian Moen commented on SOLR-3524: -- Committed revision 1360613 on {{branch_4x}} Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Assignee: Christian Moen Priority: Minor Fix For: 4.0, 5.0 Attachments: SOLR-3524.patch, SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412685#comment-13412685 ] Christian Moen commented on SOLR-3524: -- {{CHANGES.txt}} for some reason didn't make it into {{branch_4x}}. Fixed this in revision 1360622. Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Assignee: Christian Moen Priority: Minor Fix For: 4.0, 5.0 Attachments: SOLR-3524.patch, SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407873#comment-13407873 ] Christian Moen commented on SOLR-3524: -- I'll commit this to {{trunk}} and {{branch_4x}} soon. Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Assignee: Christian Moen Priority: Minor Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291635#comment-13291635 ] Christian Moen commented on SOLR-3524: -- Hiraga-san, there are different views on how punctuation characters best are handled by tokenizers. Punctuation characters generally don't convey much meaning useful for text search, so they are generally removed in Lucene. (A different point of view is that tokenizers shouldn't remove punctuations and that filters should do this.) The ability to keep punctuation was left as an expert-feature in JapanseTokenizer and I think we can expose this as an expert feature in Solr as well. Could you share some details on your use-case just so that I get a better idea of the background and importance of this? Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291643#comment-13291643 ] Christian Moen commented on SOLR-3524: -- Ohtani-san, thanks for the patch! I've tried it on {{trunk}} and applying it fails because of an {{InitializationException}} is thrown instead of a {{SolrException}}. I'll correct this shortly. We also need some tests here... Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291787#comment-13291787 ] Jun Ohtani commented on SOLR-3524: -- Hi Christian, Sorry, I create the patch based ver. 3.6.0. Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291792#comment-13291792 ] Christian Moen commented on SOLR-3524: -- No trouble. I'll provide a new patch shortly for {{trunk}} and {{branch_4x}} with a test as well. Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291807#comment-13291807 ] Christian Moen commented on SOLR-3524: -- New patch with tests and documentation changes attached. Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291836#comment-13291836 ] Kazuaki Hiraga commented on SOLR-3524: -- Thank you guys! Christian, Since some documents have keywords that consists of alphabet and punctuation such as c++, c# and so on, We want to match those keywords with the keyword that unchanged form. Of course, we will discard punctuation in many cases but some cases, especially short text, we want to preserve punctuation. Therefore, I want to have an option that I can control this behaviour. Ohtani-san, thank you for your early reply and patch! Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org