In Elasticsearch, you can extend the existing queries and filters, by a plugin, with the help of addQuery/addFilter at IndexQueryParserModule
Each query or filter comes in a pair of classes, a builder and a parser. A filter builder manages the syntax, the content serialization with the help of XContent classes for inner/outer representation of filter specification. A filter parser parses such a structure and turns it into a Lucene Filter for internal processing. So one approach would be to look at your bit set implementation how this can be turned into a Lucene Filter. An instructive example where to start from is in org.elasticsearch.index.query.TermsFilterParser/TermsFilterBuilder An example where terms from fielddata cache are read and turned into a filter is org.elasticsearch.index.search.FielddataTermsFilter A key line is the method public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException An example for caching filters is org.elasticsearch.indices.cache.filter.terms.IndicesTermsFilterCache (the caching of filters in ES is done with Guava's cache classes) Also, it could be helpful to study helper classes in this context like in package org.elasticsearch.common.lucene.docset I am not aware of a filter plugin yet but it is possible that I could sketch a demo filter plugin source code on github. Jörg On Mon, Jul 7, 2014 at 3:49 PM, Sandeep Ramesh Khanzode < [email protected]> wrote: > Hi, > > A little clarification: > > Assume sample data set of 50M documents. The documents need to be filtered > by a field, Field1. However, at indexing time, this field is NOT written to > the document in Lucene through ES. Field1 is a frequently changing field > and hence, we will like to maintain it outside. > > (This following paragraph can be skipped.) > Now assume that there are a few such fields, Field1, ..., FieldN. For > every document in the corpus, the value for Field1 may be from a pool of > 100-odd values. Thus, for example, at max, FIeld1 can hold 1M documents > that correspond to one of the 100-dd values, and at the fag-end, can > probably correspond to 10 documents as well. > > > (Continue reading) :-) > I would, at system startup time, make sure that I have loaded all relevant > BitSets that I plan to use for any Filters in memory, so that my cache > framework is warm and I can lookup the relevant filter values for a > particular query from this cache at query run time. The mechanisms for this > loading are still unknown, but please assume that this BitSet will be > available readily during query time. > > This BitSet will correspond to the DocIDs in Lucene for a particular value > of Field1 that I want to filter. I plan to create a Filter class overridden > in Lucene that will accept this DocIdSet. > > What I am unable to understand is how I can achieve this in ES? Now, I > have been exploring the different mail threads on this forum, and it seems > that certain plugins can achieve this. Please see the list below that I > could find on this forum. > > Can you please tell me how an IndexQueryParserModule will serve my use > case? If you can provide some pointers on writing a plugin that can > leverage a CustomFilter, that will be immensely helpful. Thanks, > > 1. > https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/IndexQueryParserModule$20Plugin/elasticsearch/5Gqxx3UvN2s/FL4Lb2RxQt0J > 2. https://groups.google.com/forum/#!topic/elasticsearch/1jiHl4kngJo > 3. https://github.com/elasticsearch/elasticsearch/issues/208 > 4. > http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html > > Thanks, > Sandeep > > On Mon, Jul 7, 2014 at 2:17 AM, [email protected] < > [email protected]> wrote: > >> Thanks for being so patient with me :) >> >> I understand now the following: there are 50m of documents in an external >> DB, from which up to 1m is to be exported in form of document identifiers >> to work as a filter in ES. The idea is to use internal mechanisms like bit >> sets. There is no API for manipulating filters in ES on that level, ES >> receives the terms and passes them into Lucene TermFilter class according >> to the type of the filter. >> >> What is a bit unclear to me: how is the filter set constructed? I assume >> it should be a select statement on the database? >> >> Next, if you have this large set of document identifiers selected, I do >> not understand what is the base query you want to apply the filter on? Is >> there a user given query for ES? How does such query looks like? Is it >> assumed there are other documents in ES that are related somehow to the 50m >> documents? An illustrative example of the steps in the scenario would >> really help to understand the data model. >> >> Just some food for thought: it is close to impossible to filter in ES on >> 1m unique terms with a single step - the default setting of maximum clauses >> in a Lucene Query is for good reason limited to 1024 terms. A workaround >> would be iterating over 1m terms and execute 1000 filter queries and add up >> the results. This takes a long time and may not be the desired solution. >> >> Fortunately, in most situations, it is possible to find more concise >> grouping to reduce the 1m document identifiers into fewer ones for more >> efficient filtering. >> >> Jörg >> >> >> >> On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via >> elasticsearch <[email protected]> wrote: >> >>> Hi, >>> >>> Appreciate your continued assistance. :) Thanks, >>> >>> Disclaimer: I am yet to sufficiently understand ES sources so as to >>> depict my scenario completely. Some info' below may be conjecture. >>> >>> I would have a corpus of 50M docs (actually lot more, but for testing >>> now) out of which I would have say, upto, 1M DocIds to be used as a filter. >>> This set of 1M docs can be different for different use cases, the point >>> being, upto 1M docIds can form one logical set of documents for filtering >>> results. If I use a simple IdsFilter from ES Java API, I would have to keep >>> adding these 1M docs to the List implementation internally, and I have a >>> feeling it may not scale very well as they may change per use case and per >>> some combinations internal to a single use case also. >>> >>> As I debug the code, the IdsFilter will be converted to a Lucene filter. >>> Lucene filters, on the other hand, operate on a docId bitset type. That >>> gels very well with my requirement, since I can scale with BitSets (I >>> assume). >>> >>> If I can find a way to directly plug this BitSet as a Lucene Filter to >>> the Lucene search() call bypassing the ES filters using, I dont know, may >>> some sort of a plugin, I believe that may support my cause. I assume I may >>> not get to use the Filter cache from ES but probably I can cache these >>> BitSets for subsequent use. >>> >>> Please let me know. And thanks! >>> >>> Thanks, >>> Sandeep >>> >>> >>> On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote: >>> >>>> What I understand is a TermsFilter is required >>>> >>>> http://www.elasticsearch.org/guide/en/elasticsearch/ >>>> reference/current/query-dsl-terms-filter.html >>>> >>>> and the source of the terms is a DB. That is no problem. The plan is: >>>> fetch the terms from the DB, build the query (either Java API or JSON) and >>>> execute it. >>>> >>>> What I don't understand is the part with the "quick mapping", Lucene, >>>> and the doc ids. Lucene doc IDs are not reliable and are not exposed by >>>> Elasticsearch, Elasticsearch uses it's own document identifiers which are >>>> stable and augmented with info about the index type they belong to, in >>>> order to make them unique. But I do not understand why this is important in >>>> this context. >>>> >>>> Elasticsearch API uses query builders and filter builders to build >>>> search requests . A "quick mapping" is just fetching the terms from the DB >>>> as a string array before this API is called. >>>> >>>> I also do not understand the role of the number "1M", is this the >>>> number of fields, or the number of terms? Is it a total number or a number >>>> per query? >>>> >>>> Did I misunderstand anything more? I am not really sure what is the >>>> challenge... >>>> >>>> Jörg >>>> >>>> >>>> >>>> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via >>>> elasticsearch <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Just to give some background. I will have a large-ish corpus of more >>>>> than 100M documents indexed. The filters that I want to apply will be on a >>>>> field that is not indexed. I mean, I prefer to not have them indexed in >>>>> ES/Lucene since they will be frequently changing. So, for that, I will be >>>>> maintaining them elsewhere, like a DB etc. >>>>> >>>>> Everytime I have a query, I would want to filter the results by those >>>>> fields that are not indexed in Lucene. And I am guessing that number may >>>>> well be more than 1M. In that case, I think, since we will maintain some >>>>> sort of TermsFilter, it may not scale linearly. What I would want to do, >>>>> preferably, is to have a hook inside the ES query, so that I can, at query >>>>> time, inject the required filter values. Since the filter values have to >>>>> be >>>>> recognized by Lucene, and I will not be indexing them, I will need to do >>>>> some quick mapping to get those fields and map them quickly to some field >>>>> in Lucene that I can save in the filter. I am not sure whether we can >>>>> access and set Lucene DocIDs in the filter or whether they are even >>>>> exposed >>>>> in ES. >>>>> >>>>> Please assist with this query. Thanks, >>>>> >>>>> Thanks, >>>>> Sandeep >>>>> >>>>> >>>>> On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote: >>>>> >>>>>> Maybe I do not fully understand, but in a client, you can fetch the >>>>>> required filter terms from any external source before a JSON query is >>>>>> constructed? >>>>>> >>>>>> Can you give an example what you want to achieve? >>>>>> >>>>>> Jörg >>>>>> >>>>>> >>>>>> On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via >>>>>> elasticsearch <[email protected]> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I am new to ES and I have the following requirement: >>>>>>> I need to specify a list of strings as a filter that applies to a >>>>>>> specific field in the document. Like what a filter does, but instead of >>>>>>> sending them on the query, I would like them to be populated from an >>>>>>> external sources, like a DB or something. Can you please guide me to the >>>>>>> relevant examples or references to achieve this on v1.1.2? >>>>>>> >>>>>>> Thanks, >>>>>>> Sandeep >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f4 >>>>>>> 7-48e9-ba19-85b0850eda89%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9% >>>>> 40googlegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFz87mixh0OK-ci_6SH6hd%3D7BzGwBVSKAfXt-XRvXSi6g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
