hi jörg thank you for your quick response! glad to hear from you that you agree with me that wildcard analysis could be further improved. (concerning prefix support it's already great!) i already started to look around for other solutions like writing a plugin to use a custom queryparser or sth., but presumed i do not misinterpret your answer improving the getPossiblyAnalyzedWildcardQuery wildcard method does not sound completely absurd to you resp. is not the place/wrong approach (you also could have told me that i need to write a plugin or somehow plugin/register kind of queryparser subclass, or some other reasons why this method is written the way it is) so for the moment i will stick to/with "my" improved getPossiblyAnalyzedWildcardQuery method and do further testing with more data resp. larger indices etc. to see how it performs, (as i initially mentioned i need to "generate" even more wildcards, also leading ones to produce the desired results/matches) ... as soon as i'm convinced of the "improvement" i'll clean up the code and try to do a fork so you could have a look at it (PS. i need to familiarize mysef a bit more with git first, since i'm still one of the oldschool svn guys ;-), but i think somehow i will be able to do a fork / commit? )... it would like helping to further improve such a great software/product like elasticsearch cheers marco
Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb [email protected]: > hi > > i have text/email addresses indexed with the standard analyzer. > > e.g. > > "[email protected]" that results in two tokens being in the index: > > [marco.kamm] and [brain.net] > > i want to search using query_string query and wildcards like: > > { > fields:["contact_email"], > "query" : { > "query_string" : { > "query" : "(contact_email:(marco.*@brain.net))", > "default_operator" : "and", > "analyze_wildcard": true > } > } > } > > from my past working-experience with lucene i know that wildcards queries > are kind of problematic cause they're not analyzed by default. > (to workaround this behaviour i wrote a custom parser that prepares the > query string depending on the specific field analyzer in prior before > passing it to the lucene query parser) > > at first when i noticed the analyze_wildcard parameter/option i thought > great/cool! i no longer need my "custom magic parser ,-)", elasticsearch > provides built-in support for my problems ... > > when testing the "analyze_wildcard" behaviour with "pure" prefix queries > like "marco.kamm@brain.*" it worked like a charm! resp. did the same > thing i tried to achive with my > custom "pre-parser". the query was "transformed" to sth. like > "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches > what's in the index ... > > but unfortunately testing with "real" wildcard queries like the above " > marco.*@brain.net" is giving me a query that won't find anything in my > situation cause it will be > turned into: "contact_email:marco*brain.net" and there's no single! token > in my index that will match (although it gets analyzed). to find some > results the query rather would have > to be turned int sth. like: "contact_email:marco* AND contact_email: > brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the > user search for "marco.*.net") ... > > by looking at the source code of > org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually > started to dive into the source code by chasing down the "rather small" > already mentioned issue > with the harcoded boolean.clause OR operator here: > https://github.com/elasticsearch/elasticsearch/issues/2183) i realized > that there are two different methods for analyzing pure wildcard and prefix > queries > (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i > first expected this cases to be handled by the same code) and that's why > i'm getting the perfect results for prefix queries and sadly not working > ones for > pure wildcard ones ... > > i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery > method by rewriting it in a way to work more like the > getPossiblyAnalyzedPrefixQuery method resp. > instead of generating only a single one wildcardquery object with the > analyzed string, it builds a boolean query including several wildcardquery > objects (splitting on */?)... > > my first tests showed that this would work quite well! ... > > > > now my questions: > > what do you think about this "approach"? > > do you see any serious drawbacks, besides performance > i know that using even more wildcards will drastically reduce the search > performance > but better trying to finally serve some results after quite long time than > finding nothing at all? > > (i also know that lucene is not built/optimized for wildcards queries and > some cases could be resolved using different analyzers (ngram, reverse), > multiple fields etc. > but users are used to, and there could be usecases where such wildcard > queries could make sense > resp. where it's not practicable to use keyword analyzers that wont suffer > from such problems e.g for longer text etc)! > > do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method > (although it is stated in the docs that this method does best efforts)? > > (btw. do you also plan to fix the OR operator issue, could be rather > simple just use the specified parameter) > > if my approach is legit and given that i dont like having to modify the > elasticsearch "core" code and rebuild/adapt it with every new release > how/where else > could i implement such an extension? do i have to write a custom > queryparser (maybe extends MapperQueryParser) and build my own plugin / > rest endpoint ... > > (i recently found out that there's also a lucene class called > AnalyzingQueryParser maybe i should have used this one instead of writing > my own magic-parser, is/could this be used somehow in elasticsearch? > > is there a possibility to / should i write a feature request for even more > best effor on analyzing wildcard queries. PS i know the wildcard handling > issue could be a pain in the a**, and maybe could only be solved on a best > efford basis?. but i'm somehow forced to mess around with this cause i have > to (want!) to port my old lucene stuff to elasticsearch (except this issue > i think elasticsearch is a great product and i like to work with it. this > problem lies in the nature of inverted indices and wildcards resp. > analyzers) > > > sorry for the long maybe confusing mail, but i need your expert > thoughts/advices about this wildcard issue > > thank you > regards marco > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
