[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594154#comment-16594154 ] Ivan Provalov commented on LUCENE-7321: --- [~arafalov], the clean use case is for this filter is to externalize the morphological modifications rules. Most stemmers have hard-coded rules. With this one, the rules are expressed in the flat mapping files and configurations. Originally, it was developed to extend a few cases for some languages listed here and a few other languages, as well as to visualize these rules which would help the linguists involved in the project to understand the modification rules for more complex scenarios. I added the Russian stemmer implementation as a general reference just to show how one can configure the entire stemmer implementation without hard-coded rules. We have not seen any performance issues with this so far. Hope this helps. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594099#comment-16594099 ] Alexandre Rafalovitch commented on LUCENE-7321: --- This feels a little bit like too many use-cases folded into one piece of code. Arabic, Japanese, Korean names special handling, Russian already covered by the stemmer. I am not sure what the clean use-case is here. Especially with say [PatternReplaceCharFilterFactory|http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html] being there to cover possible special use-case gaps (at a lower performance perhaps). And with ICU4J possibly covering others. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593976#comment-16593976 ] Ivan Provalov commented on LUCENE-7321: --- [~erickerickson], Good questions: 1. I just ran the tests in the patch against the master, they passed. 2. It allows you to configure/modify morphological analysis with externalized mapping files. I attached a description and a reference implementation of the Russian stemmer using this filter. Thanks, Ivan > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593854#comment-16593854 ] Erick Erickson commented on LUCENE-7321: [~iprovalo] Ohhh, you would have to skewer me wouldn't you? I have no idea about the merits of this patch, this isn't something I work with. Does it apply to master? and what does it _do_? > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593779#comment-16593779 ] Ivan Provalov commented on LUCENE-7321: --- [~erike4...@yahoo.com], any progress on committing this patch? Thanks, Ivan > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593170#comment-16593170 ] Erick Erickson commented on LUCENE-7321: There's a great chance if someone submits a patch and it gets committed. It's only because people step up and volunteer to improve things that language support improves... > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593101#comment-16593101 ] Nick Chervov commented on LUCENE-7321: -- Hi everyone! Is there any chance to get better Russian support in future releases of Solr? > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424132#comment-16424132 ] Alexey Ponomarenko commented on LUCENE-7321: Hi is an any plan to integrate it to the Lucene\Solr? > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321865#comment-15321865 ] Ivan Provalov commented on LUCENE-7321: --- Koji, this one works on a token level, allowing do things like prefix/suffix manipulations. Graph generator and collapser also makes it user friendly when dealing with a lot of mappings (please see the attached description file). > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321858#comment-15321858 ] Koji Sekiguchi commented on LUCENE-7321: What is the advantage of this compared to MappingCharFilter? > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org