First, check which transforms are already handled by Unicode normalization. Put 
this in all of your analyzer chains:

        <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>

Probably need this in solrconfig.xml:

 <!-- extras for ICU-based Unicode normalization -->
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
regex=".*\.jar" />
  <lib 
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
regex=".*\.jar" />

I really cannot think of a reason to use unnormalized Unicode in Solr. That 
should be in all the sample files.

For search character matching, yes, all spaces should be normalized. I have too 
many hacks fixing non-breaking spaces spread around the code. When matching, 
there is zero use for stuff like ideographic space (U+3000).

I’m not sure if quotes are normalized. I did some searching around without 
success. That might come under character folding. There was a draft, now 
withdrawn, for standard character folding. I’d probably start there for a 
Unicode folding char filter.

https://www.unicode.org/reports/tr30/tr30-4.html

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <msoko...@gmail.com> wrote:
> 
> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
> since it is dismax parser you are talking about, which really lives in Solr. 
> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
> those in their terms? Also, it leads to a kind of slippery slope: would you 
> also want to convert all the various white space characters (no-break space, 
> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
> "operator" characters like brackets?
> 
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <johnryan_...@yahoo.com.invalid> 
> wrote:
> I'm looking to create an issue to add support for Unicode Double Quotes to 
> the dismax parser. 
> 
> I want to replace all types of double quotes with standard ones before they 
> get stripped 
> 
> i.e.
>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
> 
> With 
>         "
> I presume this has been discussed before?
> 
> I have a POC here: 
> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
> 
> Thanks, 
> 
> John
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 

Reply via email to