First, check which transforms are already handled by Unicode normalization. Put
this in all of your analyzer chains:
<charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
Probably need this in solrconfig.xml:
<!-- extras for ICU-based Unicode normalization -->
<lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/"
regex=".*\.jar" />
<lib
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"
regex=".*\.jar" />
I really cannot think of a reason to use unnormalized Unicode in Solr. That
should be in all the sample files.
For search character matching, yes, all spaces should be normalized. I have too
many hacks fixing non-breaking spaces spread around the code. When matching,
there is zero use for stuff like ideographic space (U+3000).
I’m not sure if quotes are normalized. I did some searching around without
success. That might come under character folding. There was a draft, now
withdrawn, for standard character folding. I’d probably start there for a
Unicode folding char filter.
https://www.unicode.org/reports/tr30/tr30-4.html
wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/ (my blog)
> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected]> wrote:
>
> I think this is probably better to discuss on solr-user, or maybe solr-dev,
> since it is dismax parser you are talking about, which really lives in Solr.
> However, my 2c - this seems somewhat dubious. Maybe people want to include
> those in their terms? Also, it leads to a kind of slippery slope: would you
> also want to convert all the various white space characters (no-break space,
> thin space, em space, etc) as vanilla ascii 32? How about all the other
> "operator" characters like brackets?
>
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected]>
> wrote:
> I'm looking to create an issue to add support for Unicode Double Quotes to
> the dismax parser.
>
> I want to replace all types of double quotes with standard ones before they
> get stripped
>
> i.e.
> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>
> With
> "
> I presume this has been discussed before?
>
> I have a POC here:
> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
>
> Thanks,
>
> John
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> <mailto:[email protected]>
> For additional commands, e-mail: [email protected]
> <mailto:[email protected]>
>