[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899568#action_12899568
 ] 

Jan Høydahl commented on SOLR-1979:
-----------------------------------

I have implemented a first shot patch using the Tika LanguageIdentifier. It is 
unfortunately quite limited in features, and for short text segments, 
isReasonablyCertain() always returns false :( Also, the number of languages 
supported is still quite low. But it works as a start, and then we can focus on 
improving the Tika code in future releases.

I plan on putting the patch in contrib/extraction, since it depends on Tika. If 
I put it relative to main, Solr will not compile unless you put tika jar in 
lib. Agree?

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Priority: Minor
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we should wrap the [Nutch 
> LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
>  in an UpdateProcessor. The processor should be configured like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">title,teaser,body</str>
>     <str name="isoOutputField">language</str>
>     <str name="fullOutputField">language_display</str>
>   </processor>  
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to