Re: analyzing wildcard queries ...

[email protected] Wed, 19 Nov 2014 05:27:06 -0800

Regarding github, you can follow the help at

http://www.elasticsearch.org/contributing-to-elasticsearch/


but if you feel more comfortable, you can also just post a diff/patch
somewhere (preferably against HEAD) with your changes/additions.

This would be enough at least for me to have a first look.

Jörg


On Wed, Nov 19, 2014 at 1:30 PM, <[email protected]> wrote:

> hi jörg
>
> thank you for your quick response!
>
> glad to hear from you that you agree with me that wildcard analysis could
> be further improved. (concerning prefix support it's already great!)
> i already started to look around for other solutions like writing a plugin
> to use a custom queryparser or sth., but presumed i do not misinterpret
> your answer
> improving the getPossiblyAnalyzedWildcardQuery wildcard method does not
> sound completely absurd to you resp. is not the place/wrong approach
> (you also could have told me that i need to write a plugin or somehow
> plugin/register kind of queryparser subclass, or some other reasons why
> this method is written the way it is)
>
> so for the moment i will stick to/with "my" improved
> getPossiblyAnalyzedWildcardQuery method and do further testing with more
> data resp. larger indices etc. to see how it performs, (as i initially
> mentioned i need to "generate" even more wildcards, also leading ones to
> produce the desired results/matches) ...
>
> as soon as i'm convinced of the "improvement" i'll clean up the code and
> try to do a fork so you could have a look at it
> (PS. i need to familiarize mysef a bit more with git first, since i'm
> still one of the oldschool svn guys ;-), but i think somehow i will be able
> to do a fork / commit? )...
>
> it would like helping to further improve such a great software/product
> like elasticsearch
>
> cheers marco
>
>
> Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb [email protected]:
>
>>  hi
>>
>> i have text/email addresses indexed with the standard analyzer.
>>
>> e.g.
>>
>> "[email protected]" that results in two tokens being in the index:
>>
>> [marco.kamm] and [brain.net]
>>
>> i want to search using query_string query and wildcards like:
>>
>> {
>>   fields:["contact_email"],
>>   "query" : {
>>     "query_string" : {
>>       "query" : "(contact_email:(marco.*@brain.net))",
>>       "default_operator" : "and",
>>    "analyze_wildcard": true
>>     }
>>   }
>> }
>>
>> from my past working-experience with lucene i know that wildcards queries
>> are kind of problematic cause they're not analyzed by default.
>> (to workaround this behaviour i wrote a custom parser that prepares the
>> query string depending on the specific field analyzer in prior before
>> passing it to the lucene query parser)
>>
>> at first when i noticed the analyze_wildcard parameter/option i thought
>> great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
>> provides built-in support for my problems ...
>>
>> when testing the "analyze_wildcard" behaviour with "pure" prefix queries
>> like "marco.kamm@brain.*" it worked like a charm! resp. did the same
>> thing i tried to achive with my
>> custom "pre-parser". the query was "transformed" to sth. like
>> "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches
>> what's in the index ...
>>
>> but unfortunately testing with "real" wildcard queries like the above "
>> marco.*@brain.net" is giving me a query that won't find anything in my
>> situation cause it will be
>> turned into: "contact_email:marco*brain.net" and there's no single!
>> token in my index that will match (although it gets analyzed). to find some
>> results the query rather would have
>> to be turned int sth. like: "contact_email:marco* AND contact_email:
>> brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if
>> the user search for "marco.*.net") ...
>>
>> by looking at the source code of 
>> org.apache.lucene.queryparser.classic.MapperQueryParser.java
>> (i actually started to dive into the source code by chasing down the
>> "rather small" already mentioned issue
>> with the harcoded boolean.clause OR operator here: https://github.com/
>> elasticsearch/elasticsearch/issues/2183) i realized that there are two
>> different methods for analyzing pure wildcard and prefix queries
>> (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
>> first expected this cases to be handled by the same code) and that's why
>> i'm getting the perfect results for prefix queries and sadly not working
>> ones for
>> pure wildcard ones ...
>>
>> i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
>> method by rewriting it in a way to work more like the
>> getPossiblyAnalyzedPrefixQuery method resp.
>> instead of generating only a single one wildcardquery object with the
>> analyzed string, it builds a boolean query including several wildcardquery
>> objects (splitting on */?)...
>>
>> my first tests showed that this would work quite well! ...
>>
>>
>>
>> now my questions:
>>
>> what do you think about this "approach"?
>>
>> do you see any serious drawbacks, besides performance
>> i know that using even more wildcards will drastically reduce the search
>> performance
>> but better trying to finally serve some results after quite long time
>> than finding nothing at all?
>>
>> (i also know that lucene is not built/optimized for wildcards queries and
>> some cases could be resolved using different analyzers (ngram, reverse),
>> multiple fields etc.
>> but users are used to, and there could be usecases where such wildcard
>> queries could make sense
>> resp. where it's not practicable to use keyword analyzers that wont
>> suffer from such problems e.g for longer text etc)!
>>
>> do you plan to further enhance the getPossiblyAnalyzedWildcardQuery
>> method (although it is stated in the docs that this method does best
>> efforts)?
>>
>> (btw. do you also plan to fix the OR operator issue, could be rather
>> simple just use the specified parameter)
>>
>> if my approach is legit and given that i dont like having to modify the
>> elasticsearch "core" code and rebuild/adapt it with every new release
>> how/where else
>> could i implement such an extension? do i have to write a custom
>> queryparser (maybe extends MapperQueryParser) and build my own plugin /
>> rest endpoint ...
>>
>> (i recently found out that there's also a lucene class called
>> AnalyzingQueryParser maybe i should have used this one instead of writing
>> my own magic-parser, is/could this be used somehow in elasticsearch?
>>
>> is there a possibility to / should i write a feature request for even
>> more best effor on analyzing wildcard queries. PS i know the wildcard
>> handling issue could be a pain in the a**, and maybe could only be solved
>> on a best efford basis?. but i'm somehow forced to mess around with this
>> cause i have to (want!) to port my old lucene stuff to elasticsearch
>> (except this issue i think elasticsearch is a great product and i like to
>> work with it. this problem lies in the nature of inverted indices and
>> wildcards resp. analyzers)
>>
>>
>> sorry for the long maybe confusing mail, but i need your expert
>> thoughts/advices about this wildcard issue
>>
>> thank you
>> regards marco
>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFfXx_P8B2XrYw3WFXGMHDQ9N9bDYTZaEi504YnNUoEBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: analyzing wildcard queries ...

Reply via email to