Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Tom Hill Fri, 14 Sep 2007 12:22:54 -0700

Hi Marc,

The searches are going to look for an exact match of the query (after
analysis) in the index (after analysis).


So, realli will not match really.

So you want to have the same stemmer (probably not the English one, given
your examples) in both in index analyzer, and the query analyzer. I've
appended the section from solr 1.2 example schema.xml, note
EnglishPorterFilterFactory is in both sections. That would be what you want
to do, with the appropriate stemmer for your application.

Or, you could use no stemmer for BOTH, but I think most people go with
stemming. At least, I do. :-)

Tom

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="
stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="
protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="
stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="
protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

On 9/14/07, Marc Bechler <[EMAIL PROTECTED]> wrote:
>
> Index for "really": 5* really. Query for "really": 5* really, 2* realli
> (from: EnglishPorterFilterFactory {protected=protwords.txt},
> RemoveDuplicatesTokenFilterFactory {})
>
> For "this" everyting is completely fine.
>
> Is a complete matching required between index and query or is a partial
> matching also okay?
>
> Thanks for helping me
>
>   marc
>
>
>
>
> Tom Hill schrieb:
> > Hi Marc,
> >
> > Are you using the same stemmer on your queries that you use when
> indexing?
> >
> > Try the analysis function in the admin UI, to see how things are stemmed
> for
> > indexing vs. querying. If they don't match for really and fünny, and do
> > match for kraßen, then that's your problem.
> >
> > Tom
> >
> >
> > On 9/14/07, Marc Bechler <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >>
> >> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
> >> Thanks for the advice.
> >>
> >> But now I am really curioused. After indexing the document from
> scratch,
> >> I have the effect that queries to "this" and "is" work fine, whereas
> >> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
> >> after extending my sometext to "This is really fünny kraßen.", queries
> >> to "really" and "fünny" still do not work, but "kraßen" is found.
> >> Now I am somehow confused -- hopefully anyone has a good explanation
> ;-)
> >>
> >> Regards,
> >>
> >>   marc
> >>
> >>> Tom Hill schrieb:
> >>>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
> >>>> tomcat connector.
> >>>>
> >>>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
> >>>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
> >>>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
> >>>> disableUploadTimeout="true" URIEncoding="UTF-8" />
> >>>>
> >>>> use the analysis page of the admin interface to check to see what's
> >>>>  happening to your queries, too.
> >>>>
> >>>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
> >>>> port # may vary)
> >>>>
> >>>> Tom
> >>>>
> >>>> On 9/13/07, Marc Bechler < [EMAIL PROTECTED]> wrote:
> >>>>> Hi SOLR kings,
> >>>>>
> >>>>> I'm just playing around with queries, but I was not able to query
> >>>>> for any special characters like the German "Umlaute" ( i.e., ä, ö,
> >>>>> ü). Maybe others might have the same effects and already found a
> >>>>> solution ;-)
> >>>>>
> >>>>> Here is my example: I have one field called "sometext" of type
> >>>>> "text" (the one delivered with the SOLR example). I indexed a few
> >>>>> words similar to
> >>>>>
> >>>>> <field name="sometext"> <![CDATA[ This is really fünny
> >>>>> ]]></field>
> >>>>>
> >>>>> Works fine, and searching for "really" shows the result and fünny
> >>>>> will be displayed correctly. However, the query for "fünny" using
> >>>>> the /solr/admin page is resolved (correctly) to the URL
> >>>>> ...q=f%C3%BCnny... but does not find the document.
> >>>>>
> >>>>> And now the question: Any ideas? ;-)
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> marc
> >>>>>
> >
>

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Reply via email to