[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jan Høydahl updated SOLR-1979: ------------------------------ Attachment: SOLR-1979.patch Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 gives certainty 0.9. The default threshold of 0.5 now works pretty well, at least for the tests... *New parameters:* Field name mapping is now configurable to user defined pattern, so to map ABC_title to title_<lang>, you set: {code} &langid.map.pattern=ABC_(.*) &langid.map.replace=$1_{lang} {code} A parameter to map multiple detected languages to same field regex. I.e. to map both Japanese, Korean and Chinese texts to a field *_cjk, do: {code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code} Turn off validation of field names against schema (useful if you want to rename or delete fields later in the UpdateChain): {code}&langid.enforceSchema=false{code} *Other changes* Removed default on langField, i.e. if langField is not specified, the detected language will not be written anywhere. A typical minimal config for only detecting language and writing to a field is now: {code} <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <defaults> <str name="langid.fl">title,subject,text,keywords</str> <str name="langid.langField">language_s</str> </defaults> </processor> {code} Also added multiple other languages to the tests. > Create LanguageIdentifierUpdateProcessor > ---------------------------------------- > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Jan Høydahl > Assignee: Jan Høydahl > Priority: Minor > Labels: UpdateProcessor > Fix For: 3.4 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org