[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch SOLR-1979-branch_3x.patch Added final patches which will be committed now. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: contrib - LangId, update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch New patch: * Added contrib folders to eclipse dot.classpath * Added javadoc entries to build.xml * Fixed Javadoc errors * Upgraded test case to use schema v1.4 Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: contrib - LangId, update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Fixed java.lang.IndexOutOfBoundsException bug in resolveLanguage() when no languages detected. Added more corner case tests. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: contrib - LangId, update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Some further improvements: * Default fallback language if none set is now to avoid nullpointer exception * All individually detected languages are now added to langsField array * More tests Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: contrib - LangId, update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Added link to Wiki in example update chain in solrconfig Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: contrib - LangId, update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Patch updated to fit new directory structure, updated comments to point to Wiki doc. Also optimized regex, now pre-compiling patterns instead of using String.replace directly. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch New patch with these improvements: * Now also allows config at first level, without lst name=default * Added langid to example schema (commented out), so it is really easy to demonstrate Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Description: Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection was: Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. Fix Version/s: 4.0 Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Component/s: contrib - LangId Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: contrib - LangId, update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5, 4.0 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Fix Version/s: (was: 3.4) 3.5 Moving to 3.5 Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.5 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Updated to latest trunk, simplified build file, added clean target Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.4 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Fix Version/s: 3.4 Labels: UpdateProcessor (was: ) Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.4 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 gives certainty 0.9. The default threshold of 0.5 now works pretty well, at least for the tests... *New parameters:* Field name mapping is now configurable to user defined pattern, so to map ABC_title to title_lang, you set: {code} langid.map.pattern=ABC_(.*) langid.map.replace=$1_{lang} {code} A parameter to map multiple detected languages to same field regex. I.e. to map both Japanese, Korean and Chinese texts to a field *_cjk, do: {code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code} Turn off validation of field names against schema (useful if you want to rename or delete fields later in the UpdateChain): {code}langid.enforceSchema=false{code} *Other changes* Removed default on langField, i.e. if langField is not specified, the detected language will not be written anywhere. A typical minimal config for only detecting language and writing to a field is now: {code} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory defaults str name=langid.fltitle,subject,text,keywords/str str name=langid.langFieldlanguage_s/str /defaults /processor {code} Also added multiple other languages to the tests. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Labels: UpdateProcessor Fix For: 3.4 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch New version. Example of accepted params: {code} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory defaults str name=langidtrue/str str name=langid.fltitle,subject,text,keywords/str str name=langid.langFieldlanguage_s/str str name=langid.langsFieldlanguages/str str name=langid.overwritefalse/str float name=langid.threshold0.5/float str name=langid.whitelistno,en,es,dk/str str name=langid.maptrue/str str name=langid.map.fltitle,text/str bool name=langid.map.overwritefalse/bool bool name=langid.map.keepOrigfalse/bool bool name=langid.map.individualfalse/bool str name=langid.map.individual.fl/str str name=langid.fallbackFieldsmeta_content_language,lang/str str name=langid.fallbacken/str /defaults /processor {code} The only mandatory parameter is langid.fl To enable field name mapping, set langid.map=true. It will then map field names for all fields in langid.fl. If the set of fields to map is different from langid.fl, supply langid.map.fl. Those fields will then be renamed with a language suffix equal to the language detected from the langid.fl fields. If you require detecting languages separately for each field, supply langid.map.individual=true. The supplied fields will then be renamed according to detected language on an individual basis. If the set of fields to detect individually is different from the already supplied langid.fl or langid.map.fl, supply langid.map.individual.fl. The fields listed in langid.map.individual.fl will then be detected individually, while the rest of the mapping fields will be mapped according to global document language. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory str name=inputFieldsname,subject/str str name=outputFieldlanguage_s/str str name=idFieldid/str str name=fallbacken/str /processor {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Description: Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. was: We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory str name=inputFieldsname,subject/str str name=outputFieldlanguage_s/str str name=idFieldid/str str name=fallbacken/str /processor {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Priority: Minor Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1979: -- Attachment: SOLR-1979.patch Removes mentions of ISO 639. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory str name=inputFieldsname,subject/str str name=outputFieldlanguage_s/str str name=idFieldid/str str name=fallbacken/str /processor {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1979: -- Attachment: SOLR-1979.patch I took Jan's and Tommaso's patches and reworked them a bit. It seems to me that there isn't much point in merely identifying the language if you aren't going to do something about it. So, this patch builds on what Jan and Tommaso did and then will remap the input fields to new per language fields (note, we could make this optional). I also tried to standardize the input parameters a bit. I dropped the outputField setting and a number of other settings and I made the language detection to be per input field. The basic gist of it is that if you input two fields: name, subject, it will detect the language of each field and then attempt to map them to a new field. The new field is made by concatenating the original field name with _ + the ISO 639 code. For example, if en is the detected language, then the new field for name would be name_en. If that field doesn't exist, it will fall back to the original field (i.e. name). Left to do: # Fix the tests. I don't like how we currently tests UpdateProcessorChains. It should not require writing your own little piece of update mechanism. You should be able to simply setup the appropriate configuration, hook it into an update handler and then hit that update handler. # Need to check the license headers, builds, etc. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1979.patch, SOLR-1979.patch We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory str name=inputFieldsname,subject/str str name=outputFieldlanguage_s/str str name=idFieldid/str str name=fallbacken/str /processor {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1979: -- Attachment: SOLR-1979.patch Here's a patch that passes the tests. Note, I modified the Solr base test case to have some new methods to properly call update handlers and then validate the results. Create LanguageIdentifierUpdateProcessor Key: SOLR-1979 URL: https://issues.apache.org/jira/browse/SOLR-1979 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} processor class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory str name=inputFieldsname,subject/str str name=outputFieldlanguage_s/str str name=idFieldid/str str name=fallbacken/str /processor {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org