Re: Unicode Quotes in query parser

John Ryan Mon, 28 Jan 2019 06:00:02 -0800

Thanks Michael,

The dismax hardler does indeed run and escape all non standard characters 
before handing it off to the analysers and tokenisers. This fix looks like it 
belongs more in the handler, more than the parser. I wrote a SearchComponent 
handler to do the same thing at that level and can drop it in as a plugin - 
although it seems like something that should be out-of-the-box.


I think I’ll close it and reconsider the implementation.

John

> On 22 Jan 2019, at 16:20, Michael Sokolov <msoko...@gmail.com> wrote:
> 
> Right - QueryParsers generally do a first pass, parsing incoming Strings 
> using their operator characters tok tokenize the input and only after that do 
> they pass the tokens (or phrases) to an Analyzer. I haven't checked Dismax - 
> not sure how it does its parsing exactly, but I doubt you can just "turn on 
> the right Analyzer" to get it to recognize curly quotes as phrase operators, 
> eg.
> 
> On Tue, Jan 22, 2019 at 10:39 AM Mikhail Khludnev <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> My impression that these quotes are ones which are part of dismax query 
> syntax ie they should be handled before the analysis happens. 
> 
> On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood <wun...@wunderwood.org 
> <mailto:wun...@wunderwood.org>> wrote:
> First, check which transforms are already handled by Unicode normalization. 
> Put this in all of your analyzer chains:
> 
>         <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
> 
> Probably need this in solrconfig.xml:
> 
>  <!-- extras for ICU-based Unicode normalization -->
>   <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
> regex=".*\.jar" />
>   <lib 
> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
> regex=".*\.jar" />
> 
> I really cannot think of a reason to use unnormalized Unicode in Solr. That 
> should be in all the sample files.
> 
> For search character matching, yes, all spaces should be normalized. I have 
> too many hacks fixing non-breaking spaces spread around the code. When 
> matching, there is zero use for stuff like ideographic space (U+3000).
> 
> I’m not sure if quotes are normalized. I did some searching around without 
> success. That might come under character folding. There was a draft, now 
> withdrawn, for standard character folding. I’d probably start there for a 
> Unicode folding char filter.
> 
> https://www.unicode.org/reports/tr30/tr30-4.html 
> <https://www.unicode.org/reports/tr30/tr30-4.html>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <msoko...@gmail.com 
>> <mailto:msoko...@gmail.com>> wrote:
>> 
>> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
>> since it is dismax parser you are talking about, which really lives in Solr. 
>> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
>> those in their terms? Also, it leads to a kind of slippery slope: would you 
>> also want to convert all the various white space characters (no-break space, 
>> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
>> "operator" characters like brackets?
>> 
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <johnryan_...@yahoo.com.invalid 
>> <mailto:johnryan_...@yahoo.com.invalid>> wrote:
>> I'm looking to create an issue to add support for Unicode Double Quotes to 
>> the dismax parser. 
>> 
>> I want to replace all types of double quotes with standard ones before they 
>> get stripped 
>> 
>> i.e.
>>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ ＂
>> 
>> With 
>>         "
>> I presume this has been discussed before?
>> 
>> I have a POC here: 
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
>> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
>> 
>> Thanks, 
>> 
>> John
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> <mailto:dev-unsubscr...@lucene.apache.org>
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> <mailto:dev-h...@lucene.apache.org>
>> 
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

Re: Unicode Quotes in query parser

Reply via email to