[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
------------------------------

    Attachment: SOLR-1979.patch

Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 
gives certainty 0.9. The default threshold of 0.5 now works pretty well, at 
least for the tests...

*New parameters:*
Field name mapping is now configurable to user defined pattern, so to map 
ABC_title to title_<lang>, you set:
{code}
&langid.map.pattern=ABC_(.*)
&langid.map.replace=$1_{lang}
{code}
A parameter to map multiple detected languages to same field regex. I.e. to map 
both Japanese, Korean and Chinese texts to a field *_cjk, do:
{code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code}
Turn off validation of field names against schema (useful if you want to rename 
or delete fields later in the UpdateChain):
{code}&langid.enforceSchema=false{code}

*Other changes*
Removed default on langField, i.e. if langField is not specified, the detected 
language will not be written anywhere. A typical minimal config for only 
detecting language and writing to a field is now:
{code}
<processor 
class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
   <defaults>
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
   </defaults>
</processor>
{code}

Also added multiple other languages to the tests.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>              Labels: UpdateProcessor
>             Fix For: 3.4
>
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to