[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
------------------------------

    Description: 
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
processor is configurable like this:

{code:xml} 
  <processor 
class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <str name="inputFields">name,subject</str>
    <str name="outputField">language_s</str>
    <str name="idField">id</str>
    <str name="fallback">en</str>
  </processor>
{code} 

It will then read the text from inputFields name and subject, perform language 
identification and output the ISO code for the detected language in the 
outputField. If no language was detected, fallback language is used.

  was:
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we should wrap the [Nutch 
LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
 in an UpdateProcessor. The processor should be configured like this:

{code:xml} 
  <processor 
class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <str name="inputFields">title,teaser,body</str>
    <str name="isoOutputField">language</str>
    <str name="fullOutputField">language_display</str>
  </processor>  
{code} 


> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Priority: Minor
>         Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to