First, check which transforms are already handled by Unicode normalization. Put this in all of your analyzer chains:
<charFilter class="solr.ICUNormalizer2CharFilterFactory"/> Probably need this in solrconfig.xml: <!-- extras for ICU-based Unicode normalization --> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" regex=".*\.jar" /> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" /> I really cannot think of a reason to use unnormalized Unicode in Solr. That should be in all the sample files. For search character matching, yes, all spaces should be normalized. I have too many hacks fixing non-breaking spaces spread around the code. When matching, there is zero use for stuff like ideographic space (U+3000). I’m not sure if quotes are normalized. I did some searching around without success. That might come under character folding. There was a draft, now withdrawn, for standard character folding. I’d probably start there for a Unicode folding char filter. https://www.unicode.org/reports/tr30/tr30-4.html wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 21, 2019, at 7:43 AM, Michael Sokolov <msoko...@gmail.com> wrote: > > I think this is probably better to discuss on solr-user, or maybe solr-dev, > since it is dismax parser you are talking about, which really lives in Solr. > However, my 2c - this seems somewhat dubious. Maybe people want to include > those in their terms? Also, it leads to a kind of slippery slope: would you > also want to convert all the various white space characters (no-break space, > thin space, em space, etc) as vanilla ascii 32? How about all the other > "operator" characters like brackets? > > On Mon, Jan 21, 2019 at 9:50 AM John Ryan <johnryan_...@yahoo.com.invalid> > wrote: > I'm looking to create an issue to add support for Unicode Double Quotes to > the dismax parser. > > I want to replace all types of double quotes with standard ones before they > get stripped > > i.e. > “ ” „ “ „ « » ‟ ❝ ❞ ⹂ " > > With > " > I presume this has been discussed before? > > I have a POC here: > https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x > <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x> > > Thanks, > > John > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > <mailto:dev-unsubscr...@lucene.apache.org> > For additional commands, e-mail: dev-h...@lucene.apache.org > <mailto:dev-h...@lucene.apache.org> >