Hi,

we are passing a multivalued field to the LanguageIdentifierUpdateProcessor. 
This multivalued field 
contains arbitrary types (Integer, String, Date).
Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument doc, 
String[] fields), 
which btw does not use the parameter fields, is unable to parse all fields of 
the/a multivalued field.
The call "Object content = doc.getFieldValue(fieldName);" does not care what 
type the field is and just 
delegates to SolrInputDocument which in turn calls getFirstValue.

So, two issues:
first - if the first value of the multivalued field is not of type String, the 
field is ignored completely.

second - the concat method does not concat all values of a multivalued field. 
While http://www.mail-archive.com/solr-user@lucene.apache.org/msg90530.html 
states:
"The feature is designed to detect exactly one language per field.
In case of multValued, it will concatenate all values before detection."
I don't see how the code could do this.

Is this a bug? Is this a special design decision? Did we miss a certain 
configuration, that would allow the 
Language identification to use all values of a multivalued field?
We are about to write our own 
LangDetectLanguageIdentifierUpdateProcessorFactory (why is the getInstance 
hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite 
LanguageIdentifierUpdateProcessor to
handle all values of a multivalued field, ignoring non-string values.

Please see configuration below.

I hope I was able to make myself clear.

Regards,
Stephan


A little background:
We are using a 3rd-party CMS framework which pulls in some magic SOLR 
configuration (namely the textbody field).

The field we are passing is defined as 
    <!--
      The default text search field.
      This field and the field name_tokenized are used as default search fields
      for the /editor and /cmdismax search request handlers in solrconfig.xml.

      For the Content Feeder the text of all indexed fields of
      the CoreMedia document is stored in this field.
      The CAE Feeder by default stores the text of all elements in
      this field.
    -->
    <field name="textbody" type="text_general" stored="false" 
multiValued="true"/>

As you can see, it is also used as search field, therefor we want to have the 
actual datatypes on the values.
The field itself is generated by a processor, prior to calling the language 
identification (see processor chain).


The processor chain:
  <updateRequestProcessorChain>
    <!-- Improve error messages -->
    <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" />
    <!-- Blob extraction -->
    <processor class="3rdpartypackage.BinaryDataProcessorFactory">
    <!-- some comments -->
    </processor>

    <!-- Textbody handling -->
    <processor class="3rdpartypackage.TextBodyProcessorFactory" />
    <!-- Copy content of field name to name_tokenized -->
    <processor class="solr.CloneFieldUpdateProcessorFactory">
      <str name="source">name</str>
      <str name="dest">name_tokenized</str>
    </processor>
    <!--Language detection -->
    <processor 
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
      <str name="langid.fl">textbody,name_tokenized</str>
      <str name="langid.langField">language</str>
      <str name="langid.fallback">en</str>
    </processor>
    <!-- Index into language dependent fields if defined (e.g. textbody_en 
instead of textbody) -->
    <processor 
class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory">
      <str name="languageField">language</str>
      <str name="textFields">textbody,name_tokenized</str>
    </processor>

    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>


-- 
Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in
Germany versendet: http://www.gmx.net/e-mail-made-in-germany

Reply via email to