[ 
https://issues.apache.org/jira/browse/SOLR-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126258#comment-14126258
 ] 

Trey Grainger commented on SOLR-6492:
-------------------------------------

I previously implemented this field type when writing chapter 14 of _Solr in 
Action_, but I would like to make some improvements and then submit the code 
back to Solr to (hopefully) be committed. The current code from _Solr in 
Action_ can be found here:
[https://github.com/treygrainger/solr-in-action/tree/first-edition/src/main/java/sia/ch14]

To use the current version, you would do the following:
1) Add the following to schema.xml:
  <fieldType name="multiText"
        class="sia.ch14.MultiTextField" sortMissingLast="true"
        defaultFieldType="text_general"
        fieldMappings="en:text_english,
                       es:text_spanish,
                       fr:text_french,
                       fr:text_german"/> *

  <field name="someMultiTextField" type="multiText" indexed="true" 
multiValued="true" />

  *note that "text_spanish", "text_english", "text_french", and "text_german" 
refer to field types which are defined elsewhere in the schema.xml:

2) Index a document with a field containing multilingual text using syntax like 
one of the following:
  <field name="someMultiTextField">some text</field> **
  <field name="someMultiTextField">en|some text</field>
  <field name="someMultiTextField">es|some more text</field>
  <field name="someMultiTextField">de,fr|some other text</field>

  **uses the default analyzer

3) submit a query specifying which language you want to query in:
  /select?q=someMultiTextField:en,de|keyword_goes_here

--------------------------------------

Improvements to be made before the patch is finalized:
1) Make it possible to specify the field type mappings in the field name 
instead of the field value:
  <field name="someMultiTextField">de,fr|some other text</field>
  /select?q=a bunch of keywords here&df=someMultiTextField|en,de

This makes querying easier, because the languages can be detected prior to 
parsing of the query, which prevents prefixes from having to be substituted on 
each query term (which is cost-prohibitive for most because it effectively 
means pre-parsing the query before it goes to Solr).

2) Enable support for switching between "stacking" token streams from each 
analyzer (good default because it mostly respects position increments across 
languages and minimizes duplicate tokens in the index) and concatenating token 
streams.

3) Possibly add the ability to switch analyzers in the middle of input text:
<field name="someMultiTextField">de,fr|some other el|text</field>

4) Extensive unit testing

> Solr field type that supports multiple, dynamic analyzers
> ---------------------------------------------------------
>
>                 Key: SOLR-6492
>                 URL: https://issues.apache.org/jira/browse/SOLR-6492
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Trey Grainger
>             Fix For: 4.11
>
>
> A common request - particularly for multilingual search - is to be able to 
> support one or more dynamically-selected analyzers for a field. For example, 
> someone may have a "content" field and pass in a document in Greek (using an 
> Analyzer with Tokenizer/Filters for German), a separate document in English 
> (using an English Analyzer), and possibly even a field with mixed-language 
> content in Greek and English. This latter case could pass the content 
> separately through both an analyzer defined for Greek and another Analyzer 
> defined for English, stacking or concatenating the token streams based upon 
> the use-case.
> There are some distinct advantages in terms of index size and query 
> performance which can be obtained by stacking terms from multiple analyzers 
> in the same field instead of duplicating content in separate fields and 
> searching across multiple fields. 
> Other non-multilingual use cases may include things like switching to a 
> different analyzer for the same field to remove a feature (i.e. turning 
> on/off query-time synonyms against the same field on a per-query basis).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to