Thanks Walter,

The solr.ICUNormalizer2CharFilterFactory testing and research I have done leads 
me to believe that quotes are not normalised.

I attempted to do this with character folding, many implementations out there - 
but none actually seem to work. 

I’ll look into the draft.
        
Thanks
--
John  

> On 21 Jan 2019, at 17:09, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> First, check which transforms are already handled by Unicode normalization. 
> Put this in all of your analyzer chains:
> 
>         <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
> 
> Probably need this in solrconfig.xml:
> 
>  <!-- extras for ICU-based Unicode normalization -->
>   <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
> regex=".*\.jar" />
>   <lib 
> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
> regex=".*\.jar" />
> 
> I really cannot think of a reason to use unnormalized Unicode in Solr. That 
> should be in all the sample files.
> 
> For search character matching, yes, all spaces should be normalized. I have 
> too many hacks fixing non-breaking spaces spread around the code. When 
> matching, there is zero use for stuff like ideographic space (U+3000).
> 
> I’m not sure if quotes are normalized. I did some searching around without 
> success. That might come under character folding. There was a draft, now 
> withdrawn, for standard character folding. I’d probably start there for a 
> Unicode folding char filter.
> 
> https://www.unicode.org/reports/tr30/tr30-4.html 
> <https://www.unicode.org/reports/tr30/tr30-4.html>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <msoko...@gmail.com 
>> <mailto:msoko...@gmail.com>> wrote:
>> 
>> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
>> since it is dismax parser you are talking about, which really lives in Solr. 
>> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
>> those in their terms? Also, it leads to a kind of slippery slope: would you 
>> also want to convert all the various white space characters (no-break space, 
>> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
>> "operator" characters like brackets?
>> 
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <johnryan_...@yahoo.com.invalid 
>> <mailto:johnryan_...@yahoo.com.invalid>> wrote:
>> I'm looking to create an issue to add support for Unicode Double Quotes to 
>> the dismax parser. 
>> 
>> I want to replace all types of double quotes with standard ones before they 
>> get stripped 
>> 
>> i.e.
>>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>> 
>> With 
>>         "
>> I presume this has been discussed before?
>> 
>> I have a POC here: 
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
>> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
>> 
>> Thanks, 
>> 
>> John
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> <mailto:dev-unsubscr...@lucene.apache.org>
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> <mailto:dev-h...@lucene.apache.org>
>> 
> 

Reply via email to