In Elasticsearch, you can extend the existing queries and filters, by a
plugin, with the help of addQuery/addFilter at IndexQueryParserModule

Each query or filter comes in a pair of classes, a builder and a parser.

A filter builder manages the syntax, the content serialization with the
help of XContent classes for inner/outer representation of filter
specification.

A filter parser parses such a structure and turns it into a Lucene Filter
for internal processing.

So one approach would be to look at your bit set implementation how this
can be turned into a Lucene Filter. An instructive example where to start
from is
in org.elasticsearch.index.query.TermsFilterParser/TermsFilterBuilder

An example where terms from fielddata cache are read and turned into a
filter is org.elasticsearch.index.search.FielddataTermsFilter

A key line is the method

public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs)
throws IOException

An example for caching filters
is org.elasticsearch.indices.cache.filter.terms.IndicesTermsFilterCache
(the caching of filters in ES is done with Guava's cache classes)

Also, it could be helpful to study helper classes in this context like in
package org.elasticsearch.common.lucene.docset

I am not aware of a filter plugin yet but it is possible that I could
sketch a demo filter plugin source code on github.

Jörg




On Mon, Jul 7, 2014 at 3:49 PM, Sandeep Ramesh Khanzode <
[email protected]> wrote:

> Hi,
>
> A little clarification:
>
> Assume sample data set of 50M documents. The documents need to be filtered
> by a field, Field1. However, at indexing time, this field is NOT written to
> the document in Lucene through ES. Field1 is a frequently changing field
> and hence, we will like to maintain it outside.
>
> (This following paragraph can be skipped.)
> Now assume that there are a few such fields, Field1, ..., FieldN. For
> every document in the corpus, the value for Field1 may be from a pool of
> 100-odd values. Thus, for example, at max, FIeld1 can hold 1M documents
> that correspond to one of the 100-dd values, and at the fag-end, can
> probably correspond to 10 documents as well.
>
>
> (Continue reading) :-)
> I would, at system startup time, make sure that I have loaded all relevant
> BitSets that I plan to use for any Filters in memory, so that my cache
> framework is warm and I can lookup the relevant filter values for a
> particular query from this cache at query run time. The mechanisms for this
> loading are still unknown, but please assume that this BitSet will be
> available readily during query time.
>
> This BitSet will correspond to the DocIDs in Lucene for a particular value
> of Field1 that I want to filter. I plan to create a Filter class overridden
> in Lucene that will accept this DocIdSet.
>
> What I am unable to understand is how I can achieve this in ES? Now, I
> have been exploring the different mail threads on this forum, and it seems
> that certain plugins can achieve this. Please see the list below that I
> could find on this forum.
>
> Can you please tell me how an IndexQueryParserModule will serve my use
> case? If you can provide some pointers on writing a plugin that can
> leverage a CustomFilter, that will be immensely helpful. Thanks,
>
> 1.
> https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/IndexQueryParserModule$20Plugin/elasticsearch/5Gqxx3UvN2s/FL4Lb2RxQt0J
> 2. https://groups.google.com/forum/#!topic/elasticsearch/1jiHl4kngJo
> 3. https://github.com/elasticsearch/elasticsearch/issues/208
> 4.
> http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html
>
> Thanks,
> Sandeep
>
> On Mon, Jul 7, 2014 at 2:17 AM, [email protected] <
> [email protected]> wrote:
>
>> Thanks for being so patient with me :)
>>
>> I understand now the following: there are 50m of documents in an external
>> DB, from which up to 1m is to be exported in form of document identifiers
>> to work as a filter in ES. The idea is to use internal mechanisms like bit
>> sets. There is no API for manipulating filters in ES on that level, ES
>> receives the terms and passes them into Lucene TermFilter class according
>> to the type of the filter.
>>
>> What is a bit unclear to me: how is the filter set constructed? I assume
>> it should be a select statement on the database?
>>
>> Next, if you have this large set of document identifiers selected, I do
>> not understand what is the base query you want to apply the filter on? Is
>> there a user given query for ES? How does such query looks like? Is it
>> assumed there are other documents in ES that are related somehow to the 50m
>> documents? An illustrative example of the steps in the scenario would
>> really help to understand the data model.
>>
>> Just some food for thought: it is close to impossible to filter in ES on
>> 1m unique terms with a single step - the default setting of maximum clauses
>> in a Lucene Query is for good reason limited to 1024 terms. A workaround
>> would be iterating over 1m terms and execute 1000 filter queries and add up
>> the results. This takes a long time and may not be the desired solution.
>>
>> Fortunately, in most situations, it is possible to find more concise
>> grouping to reduce the 1m document identifiers into fewer ones for more
>> efficient filtering.
>>
>> Jörg
>>
>>
>>
>> On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via
>> elasticsearch <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Appreciate your continued assistance. :) Thanks,
>>>
>>> Disclaimer: I am yet to sufficiently understand ES sources so as to
>>> depict my scenario completely. Some info' below may be conjecture.
>>>
>>> I would have a corpus of 50M docs (actually lot more, but for testing
>>> now) out of which I would have say, upto, 1M DocIds to be used as a filter.
>>> This set of 1M docs can be different for different use cases, the point
>>> being, upto 1M docIds can form one logical set of documents for filtering
>>> results. If I use a simple IdsFilter from ES Java API, I would have to keep
>>> adding these 1M docs to the List implementation internally, and I have a
>>> feeling it may not scale very well as they may change per use case and per
>>> some combinations internal to a single use case also.
>>>
>>> As I debug the code, the IdsFilter will be converted to a Lucene filter.
>>> Lucene filters, on the other hand, operate on a docId bitset type. That
>>> gels very well with my requirement, since I can scale with BitSets (I
>>> assume).
>>>
>>> If I can find a way to directly plug this BitSet as a Lucene Filter to
>>> the Lucene search() call bypassing the ES filters using, I dont know, may
>>> some sort of a plugin, I believe that may support my cause. I assume I may
>>> not get to use the Filter cache from ES but probably I can cache these
>>> BitSets for subsequent use.
>>>
>>> Please let me know. And thanks!
>>>
>>> Thanks,
>>> Sandeep
>>>
>>>
>>> On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:
>>>
>>>> What I understand is a TermsFilter is required
>>>>
>>>> http://www.elasticsearch.org/guide/en/elasticsearch/
>>>> reference/current/query-dsl-terms-filter.html
>>>>
>>>> and the source of the terms is a DB. That is no problem. The plan is:
>>>> fetch the terms from the DB, build the query (either Java API or JSON) and
>>>> execute it.
>>>>
>>>> What I don't understand is the part with the "quick mapping", Lucene,
>>>> and the doc ids. Lucene doc IDs are not reliable and are not exposed by
>>>> Elasticsearch, Elasticsearch uses it's own document identifiers which are
>>>> stable and augmented with info about the index type they belong to, in
>>>> order to make them unique. But I do not understand why this is important in
>>>> this context.
>>>>
>>>> Elasticsearch API uses query builders and filter builders to build
>>>> search requests . A "quick mapping" is just fetching the terms from the DB
>>>> as a string array before this API is called.
>>>>
>>>> I also do not understand the role of the number "1M", is this the
>>>> number of fields, or the number of terms? Is it a total number or a number
>>>> per query?
>>>>
>>>> Did I misunderstand anything more? I am not really sure what is the
>>>> challenge...
>>>>
>>>> Jörg
>>>>
>>>>
>>>>
>>>> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
>>>> elasticsearch <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Just to give some background. I will have a large-ish corpus of more
>>>>> than 100M documents indexed. The filters that I want to apply will be on a
>>>>> field that is not indexed. I mean, I prefer to not have them indexed in
>>>>> ES/Lucene since they will be frequently changing. So, for that, I will be
>>>>> maintaining them elsewhere, like a DB etc.
>>>>>
>>>>> Everytime I have a query, I would want to filter the results by those
>>>>> fields that are not indexed in Lucene. And I am guessing that number may
>>>>> well be more than 1M. In that case, I think, since we will maintain some
>>>>> sort of TermsFilter, it may not scale linearly. What I would want to do,
>>>>> preferably, is to have a hook inside the ES query, so that I can, at query
>>>>> time, inject the required filter values. Since the filter values have to 
>>>>> be
>>>>> recognized by Lucene, and I will not be indexing them, I will need to do
>>>>> some quick mapping to get those fields and map them quickly to some field
>>>>> in Lucene that I can save in the filter. I am not sure whether we can
>>>>> access and set Lucene DocIDs in the filter or whether they are even 
>>>>> exposed
>>>>> in ES.
>>>>>
>>>>> Please assist with this query. Thanks,
>>>>>
>>>>> Thanks,
>>>>> Sandeep
>>>>>
>>>>>
>>>>> On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:
>>>>>
>>>>>> Maybe I do not fully understand, but in a client, you can fetch the
>>>>>> required filter terms from any external source before a JSON query is
>>>>>> constructed?
>>>>>>
>>>>>> Can you give an example what you want to achieve?
>>>>>>
>>>>>> Jörg
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
>>>>>> elasticsearch <[email protected]> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I am new to ES and I have the following requirement:
>>>>>>> I need to specify a list of strings as a filter that applies to a
>>>>>>> specific field in the document. Like what a filter does, but instead of
>>>>>>> sending them on the query, I would like them to be populated from an
>>>>>>> external sources, like a DB or something. Can you please guide me to the
>>>>>>> relevant examples or references to achieve this on v1.1.2?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sandeep
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>>
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f4
>>>>>>> 7-48e9-ba19-85b0850eda89%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
>>>>> 40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFz87mixh0OK-ci_6SH6hd%3D7BzGwBVSKAfXt-XRvXSi6g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to