[ 
https://issues.apache.org/jira/browse/SOLR-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028252#comment-16028252
 ] 

Jan Rasehorn edited comment on SOLR-6492 at 6/28/17 8:00 AM:
-------------------------------------------------------------

Hi Guys, this sounds like a solution for indexing a whole document when the 
document language is known upfront. 
But what if the language is not known upfront or if a document contains 
different text paragraphs with possibly different languages - like it can often 
be found in support tickets?

Since I did not like the approach using separate fields, I did it the following 
way:
1. I wrote a tokenizer that detects the paragraphs based on a given regexp (a 
result of cleaning up the support ticket text)
2. The tokenizer detects the paragraph language at runtime (using the solr 
built in language detector)
3. The tokenizer runs Open NLP POS tagging depending on the language it 
identified and saves the POS tags in the type attribute for each token. 
    The language is stored as payload for each token.
4. I developed a "Delegating filter", which only delegates the "incrementToken" 
call to the filter (stemmer) if the payload value matched the filter value. 
This way I can configure in schema.xml, which stemmer to use for which language.

With this approach I do not depend on knowning the document language upfront.
What do you think?



was (Author: jan rasehorn):
Hi Guys, this sounds like a solution for indexing a whole document when the 
document language is known upfront. 
But what if the language is not known upfront or if a document contains 
different text paragraphs with possibly different languages - like it can often 
be found in support tickets?

Since I did not like the approach using separate fields, I did it the following 
way:
1. I wrote a tokenizer that detects the paragraphs based on a given regexp (a 
result of cleaning up the support ticket text)
2. The tokenizer detects the paragraph language at runtime (using the solr 
built in language detector)
3. The tokenizer runs part Open NLP POS tagging depending on the language it 
identified and saves the POS tags in the type attribute for each token. 
    The language is stored as payload for each token.
4. I developed a "Delegating filter", which only delegates the "incrementToken" 
call to the filter (stemmer) if the payload value matched the filter value. 
This way I can configure in schema.xml, which stemmer to use for which language.

With this approach I do not depend on knowning the document language upfront.
What do you think?


> Solr field type that supports multiple, dynamic analyzers
> ---------------------------------------------------------
>
>                 Key: SOLR-6492
>                 URL: https://issues.apache.org/jira/browse/SOLR-6492
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Trey Grainger
>             Fix For: 5.0
>
>
> A common request - particularly for multilingual search - is to be able to 
> support one or more dynamically-selected analyzers for a field. For example, 
> someone may have a "content" field and pass in a document in Greek (using an 
> Analyzer with Tokenizer/Filters for German), a separate document in English 
> (using an English Analyzer), and possibly even a field with mixed-language 
> content in Greek and English. This latter case could pass the content 
> separately through both an analyzer defined for Greek and another Analyzer 
> defined for English, stacking or concatenating the token streams based upon 
> the use-case.
> There are some distinct advantages in terms of index size and query 
> performance which can be obtained by stacking terms from multiple analyzers 
> in the same field instead of duplicating content in separate fields and 
> searching across multiple fields. 
> Other non-multilingual use cases may include things like switching to a 
> different analyzer for the same field to remove a feature (i.e. turning 
> on/off query-time synonyms against the same field on a per-query basis).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to