First, check which transforms are already handled by Unicode normalization. Put 
this in all of your analyzer chains:

        <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>

Probably need this in solrconfig.xml:

 <!-- extras for ICU-based Unicode normalization -->
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
regex=".*\.jar" />
regex=".*\.jar" />

I really cannot think of a reason to use unnormalized Unicode in Solr. That 
should be in all the sample files.

For search character matching, yes, all spaces should be normalized. I have too 
many hacks fixing non-breaking spaces spread around the code. When matching, 
there is zero use for stuff like ideographic space (U+3000).

I’m not sure if quotes are normalized. I did some searching around without 
success. That might come under character folding. There was a draft, now 
withdrawn, for standard character folding. I’d probably start there for a 
Unicode folding char filter.

Walter Underwood  (my blog)

> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <> wrote:
> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
> since it is dismax parser you are talking about, which really lives in Solr. 
> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
> those in their terms? Also, it leads to a kind of slippery slope: would you 
> also want to convert all the various white space characters (no-break space, 
> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
> "operator" characters like brackets?
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <> 
> wrote:
> I'm looking to create an issue to add support for Unicode Double Quotes to 
> the dismax parser. 
> I want to replace all types of double quotes with standard ones before they 
> get stripped 
> i.e.
>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
> With 
>         "
> I presume this has been discussed before?
> I have a POC here: 
> <>
> Thanks, 
> John
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> <>
> For additional commands, e-mail: 
> <>

Reply via email to