hi jörg
 
thank you for your quick response!
 
glad to hear from you that you agree with me that wildcard analysis could 
be further improved. (concerning prefix support it's already great!)
i already started to look around for other solutions like writing a plugin 
to use a custom queryparser or sth., but presumed i do not misinterpret 
your answer
improving the getPossiblyAnalyzedWildcardQuery wildcard method does not 
sound completely absurd to you resp. is not the place/wrong approach 
(you also could have told me that i need to write a plugin or somehow 
plugin/register kind of queryparser subclass, or some other reasons why 
this method is written the way it is)
 
so for the moment i will stick to/with "my" improved 
getPossiblyAnalyzedWildcardQuery method and do further testing with more 
data resp. larger indices etc. to see how it performs, (as i initially 
mentioned i need to "generate" even more wildcards, also leading ones to 
produce the desired results/matches) ...
 
as soon as i'm convinced of the "improvement" i'll clean up the code and 
try to do a fork so you could have a look at it 
(PS. i need to familiarize mysef a bit more with git first, since i'm still 
one of the oldschool svn guys ;-), but i think somehow i will be able to do 
a fork / commit? )...
 
it would like helping to further improve such a great software/product like 
elasticsearch 
 
cheers marco
 

Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb [email protected]:

>  hi
>
> i have text/email addresses indexed with the standard analyzer. 
>
> e.g.
>
> "[email protected]" that results in two tokens being in the index:
>
> [marco.kamm] and [brain.net]
>
> i want to search using query_string query and wildcards like:
>
> {
>   fields:["contact_email"],
>   "query" : {
>     "query_string" : {
>       "query" : "(contact_email:(marco.*@brain.net))",
>       "default_operator" : "and",
>    "analyze_wildcard": true
>     }
>   }
> }
>
> from my past working-experience with lucene i know that wildcards queries 
> are kind of problematic cause they're not analyzed by default.
> (to workaround this behaviour i wrote a custom parser that prepares the 
> query string depending on the specific field analyzer in prior before 
> passing it to the lucene query parser)
>
> at first when i noticed the analyze_wildcard parameter/option i thought 
> great/cool! i no longer need my "custom magic parser ,-)", elasticsearch 
> provides built-in support for my problems ... 
>
> when testing the "analyze_wildcard" behaviour with "pure" prefix queries 
> like "marco.kamm@brain.*" it worked like a charm! resp. did the same 
> thing i tried to achive with my 
> custom "pre-parser". the query was "transformed" to sth. like 
> "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches 
> what's in the index ...
>
> but unfortunately testing with "real" wildcard queries like the above "
> marco.*@brain.net" is giving me a query that won't find anything in my 
> situation cause it will be 
> turned into: "contact_email:marco*brain.net" and there's no single! token 
> in my index that will match (although it gets analyzed). to find some 
> results the query rather would have 
> to be turned int sth. like: "contact_email:marco* AND contact_email:
> brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the 
> user search for "marco.*.net") ...
>
> by looking at the source code of 
> org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually 
> started to dive into the source code by chasing down the "rather small" 
> already mentioned issue
> with the harcoded boolean.clause OR operator here: 
> https://github.com/elasticsearch/elasticsearch/issues/2183) i realized 
> that there are two different methods for analyzing pure wildcard and prefix 
> queries
> (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i 
> first expected this cases to be handled by the same code) and that's why 
> i'm getting the perfect results for prefix queries and sadly not working 
> ones for
> pure wildcard ones ...
>
> i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery 
> method by rewriting it in a way to work more like the 
> getPossiblyAnalyzedPrefixQuery method resp. 
> instead of generating only a single one wildcardquery object with the 
> analyzed string, it builds a boolean query including several wildcardquery 
> objects (splitting on */?)...
>
> my first tests showed that this would work quite well! ...
>
>  
>
> now my questions:
>
> what do you think about this "approach"? 
>
> do you see any serious drawbacks, besides performance
> i know that using even more wildcards will drastically reduce the search 
> performance  
> but better trying to finally serve some results after quite long time than 
> finding nothing at all?
>
> (i also know that lucene is not built/optimized for wildcards queries and 
> some cases could be resolved using different analyzers (ngram, reverse), 
> multiple fields etc. 
> but users are used to, and there could be usecases where such wildcard 
> queries could make sense 
> resp. where it's not practicable to use keyword analyzers that wont suffer 
> from such problems e.g for longer text etc)! 
>
> do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method 
> (although it is stated in the docs that this method does best efforts)?
>
> (btw. do you also plan to fix the OR operator issue, could be rather 
> simple just use the specified parameter)
>
> if my approach is legit and given that i dont like having to modify the 
> elasticsearch "core" code and rebuild/adapt it with every new release 
> how/where else
> could i implement such an extension? do i have to write a custom 
> queryparser (maybe extends MapperQueryParser) and build my own plugin / 
> rest endpoint ...
>
> (i recently found out that there's also a lucene class called 
> AnalyzingQueryParser maybe i should have used this one instead of writing 
> my own magic-parser, is/could this be used somehow in elasticsearch?
>
> is there a possibility to / should i write a feature request for even more 
> best effor on analyzing wildcard queries. PS i know the wildcard handling 
> issue could be a pain in the a**, and maybe could only be solved on a best 
> efford basis?. but i'm somehow forced to mess around with this cause i have 
> to (want!) to port my old lucene stuff to elasticsearch (except this issue 
> i think elasticsearch is a great product and i like to work with it. this 
> problem lies in the nature of inverted indices and wildcards resp. 
> analyzers) 
>
>
> sorry for the long maybe confusing mail, but i need your expert 
> thoughts/advices about this wildcard issue
>
> thank you 
> regards marco
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to