Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

Eric Lease Morgan Fri, 20 Feb 2015 09:05:08 -0800

On Feb 16, 2015, at 4:54 PM, Levy, Michael <[email protected]> wrote:

> I think you can accomplish what you want by using ICUFoldingFilterFactory
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
> 
> which should simply perform ICU (cf http://site.icu-project.org/) based 
> character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
> 
> In schema.xml I generally have in both index and query:
> 
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.ICUFoldingFilterFactory" />



For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

Reply via email to