Re: analyzing wildcard queries ...

mkamm78 Wed, 19 Nov 2014 04:31:20 -0800

hi jörg
 
thank you for your quick response!
 
glad to hear from you that you agree with me that wildcard analysis could 
be further improved. (concerning prefix support it's already great!)
i already started to look around for other solutions like writing a plugin 
to use a custom queryparser or sth., but presumed i do not misinterpret 
your answer
improving the getPossiblyAnalyzedWildcardQuery wildcard method does not 
sound completely absurd to you resp. is not the place/wrong approach 
(you also could have told me that i need to write a plugin or somehow 
plugin/register kind of queryparser subclass, or some other reasons why 
this method is written the way it is)
 
so for the moment i will stick to/with "my" improved 
getPossiblyAnalyzedWildcardQuery method and do further testing with more 
data resp. larger indices etc. to see how it performs, (as i initially 
mentioned i need to "generate" even more wildcards, also leading ones to 
produce the desired results/matches) ...
 
as soon as i'm convinced of the "improvement" i'll clean up the code and 
try to do a fork so you could have a look at it 
(PS. i need to familiarize mysef a bit more with git first, since i'm still 
one of the oldschool svn guys ;-), but i think somehow i will be able to do 
a fork / commit? )...
 
it would like helping to further improve such a great software/product like 
elasticsearch 
 
cheers marco


Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb [email protected]:

>  hi
>
> i have text/email addresses indexed with the standard analyzer. 
>
> e.g.
>
> "[email protected]" that results in two tokens being in the index:
>
> [marco.kamm] and [brain.net]
>
> i want to search using query_string query and wildcards like:
>
> {
>   fields:["contact_email"],
>   "query" : {
>     "query_string" : {
>       "query" : "(contact_email:(marco.*@brain.net))",
>       "default_operator" : "and",
>    "analyze_wildcard": true
>     }
>   }
> }
>
> from my past working-experience with lucene i know that wildcards queries 
> are kind of problematic cause they're not analyzed by default.
> (to workaround this behaviour i wrote a custom parser that prepares the 
> query string depending on the specific field analyzer in prior before 
> passing it to the lucene query parser)
>
> at first when i noticed the analyze_wildcard parameter/option i thought 
> great/cool! i no longer need my "custom magic parser ,-)", elasticsearch 
> provides built-in support for my problems ... 
>
> when testing the "analyze_wildcard" behaviour with "pure" prefix queries 
> like "marco.kamm@brain.*" it worked like a charm! resp. did the same 
> thing i tried to achive with my 
> custom "pre-parser". the query was "transformed" to sth. like 
> "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches 
> what's in the index ...
>
> but unfortunately testing with "real" wildcard queries like the above "
> marco.*@brain.net" is giving me a query that won't find anything in my 
> situation cause it will be 
> turned into: "contact_email:marco*brain.net" and there's no single! token 
> in my index that will match (although it gets analyzed). to find some 
> results the query rather would have 
> to be turned int sth. like: "contact_email:marco* AND contact_email:
> brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the 
> user search for "marco.*.net") ...
>
> by looking at the source code of 
> org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually 
> started to dive into the source code by chasing down the "rather small" 
> already mentioned issue
> with the harcoded boolean.clause OR operator here: 
> https://github.com/elasticsearch/elasticsearch/issues/2183) i realized 
> that there are two different methods for analyzing pure wildcard and prefix 
> queries
> (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i 
> first expected this cases to be handled by the same code) and that's why 
> i'm getting the perfect results for prefix queries and sadly not working 
> ones for
> pure wildcard ones ...
>
> i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery 
> method by rewriting it in a way to work more like the 
> getPossiblyAnalyzedPrefixQuery method resp. 
> instead of generating only a single one wildcardquery object with the 
> analyzed string, it builds a boolean query including several wildcardquery 
> objects (splitting on */?)...
>
> my first tests showed that this would work quite well! ...
>
>  
>
> now my questions:
>
> what do you think about this "approach"? 
>
> do you see any serious drawbacks, besides performance
> i know that using even more wildcards will drastically reduce the search 
> performance  
> but better trying to finally serve some results after quite long time than 
> finding nothing at all?
>
> (i also know that lucene is not built/optimized for wildcards queries and 
> some cases could be resolved using different analyzers (ngram, reverse), 
> multiple fields etc. 
> but users are used to, and there could be usecases where such wildcard 
> queries could make sense 
> resp. where it's not practicable to use keyword analyzers that wont suffer 
> from such problems e.g for longer text etc)! 
>
> do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method 
> (although it is stated in the docs that this method does best efforts)?
>
> (btw. do you also plan to fix the OR operator issue, could be rather 
> simple just use the specified parameter)
>
> if my approach is legit and given that i dont like having to modify the 
> elasticsearch "core" code and rebuild/adapt it with every new release 
> how/where else
> could i implement such an extension? do i have to write a custom 
> queryparser (maybe extends MapperQueryParser) and build my own plugin / 
> rest endpoint ...
>
> (i recently found out that there's also a lucene class called 
> AnalyzingQueryParser maybe i should have used this one instead of writing 
> my own magic-parser, is/could this be used somehow in elasticsearch?
>
> is there a possibility to / should i write a feature request for even more 
> best effor on analyzing wildcard queries. PS i know the wildcard handling 
> issue could be a pain in the a**, and maybe could only be solved on a best 
> efford basis?. but i'm somehow forced to mess around with this cause i have 
> to (want!) to port my old lucene stuff to elasticsearch (except this issue 
> i think elasticsearch is a great product and i like to work with it. this 
> problem lies in the nature of inverted indices and wildcards resp. 
> analyzers) 
>
>
> sorry for the long maybe confusing mail, but i need your expert 
> thoughts/advices about this wildcard issue
>
> thank you 
> regards marco
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: analyzing wildcard queries ...

Reply via email to