Thanks for being so patient with me :)

I understand now the following: there are 50m of documents in an external
DB, from which up to 1m is to be exported in form of document identifiers
to work as a filter in ES. The idea is to use internal mechanisms like bit
sets. There is no API for manipulating filters in ES on that level, ES
receives the terms and passes them into Lucene TermFilter class according
to the type of the filter.

What is a bit unclear to me: how is the filter set constructed? I assume it
should be a select statement on the database?

Next, if you have this large set of document identifiers selected, I do not
understand what is the base query you want to apply the filter on? Is there
a user given query for ES? How does such query looks like? Is it assumed
there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.

Just some food for thought: it is close to impossible to filter in ES on 1m
unique terms with a single step - the default setting of maximum clauses in
a Lucene Query is for good reason limited to 1024 terms. A workaround would
be iterating over 1m terms and execute 1000 filter queries and add up the
results. This takes a long time and may not be the desired solution.

Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.

Jörg



On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via elasticsearch
<[email protected]> wrote:

> Hi,
>
> Appreciate your continued assistance. :) Thanks,
>
> Disclaimer: I am yet to sufficiently understand ES sources so as to depict
> my scenario completely. Some info' below may be conjecture.
>
> I would have a corpus of 50M docs (actually lot more, but for testing now)
> out of which I would have say, upto, 1M DocIds to be used as a filter. This
> set of 1M docs can be different for different use cases, the point being,
> upto 1M docIds can form one logical set of documents for filtering results.
> If I use a simple IdsFilter from ES Java API, I would have to keep adding
> these 1M docs to the List implementation internally, and I have a feeling
> it may not scale very well as they may change per use case and per some
> combinations internal to a single use case also.
>
> As I debug the code, the IdsFilter will be converted to a Lucene filter.
> Lucene filters, on the other hand, operate on a docId bitset type. That
> gels very well with my requirement, since I can scale with BitSets (I
> assume).
>
> If I can find a way to directly plug this BitSet as a Lucene Filter to the
> Lucene search() call bypassing the ES filters using, I dont know, may some
> sort of a plugin, I believe that may support my cause. I assume I may not
> get to use the Filter cache from ES but probably I can cache these BitSets
> for subsequent use.
>
> Please let me know. And thanks!
>
> Thanks,
> Sandeep
>
>
> On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:
>
>> What I understand is a TermsFilter is required
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/
>> reference/current/query-dsl-terms-filter.html
>>
>> and the source of the terms is a DB. That is no problem. The plan is:
>> fetch the terms from the DB, build the query (either Java API or JSON) and
>> execute it.
>>
>> What I don't understand is the part with the "quick mapping", Lucene, and
>> the doc ids. Lucene doc IDs are not reliable and are not exposed by
>> Elasticsearch, Elasticsearch uses it's own document identifiers which are
>> stable and augmented with info about the index type they belong to, in
>> order to make them unique. But I do not understand why this is important in
>> this context.
>>
>> Elasticsearch API uses query builders and filter builders to build search
>> requests . A "quick mapping" is just fetching the terms from the DB as a
>> string array before this API is called.
>>
>> I also do not understand the role of the number "1M", is this the number
>> of fields, or the number of terms? Is it a total number or a number per
>> query?
>>
>> Did I misunderstand anything more? I am not really sure what is the
>> challenge...
>>
>> Jörg
>>
>>
>>
>> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
>> elasticsearch <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Just to give some background. I will have a large-ish corpus of more
>>> than 100M documents indexed. The filters that I want to apply will be on a
>>> field that is not indexed. I mean, I prefer to not have them indexed in
>>> ES/Lucene since they will be frequently changing. So, for that, I will be
>>> maintaining them elsewhere, like a DB etc.
>>>
>>> Everytime I have a query, I would want to filter the results by those
>>> fields that are not indexed in Lucene. And I am guessing that number may
>>> well be more than 1M. In that case, I think, since we will maintain some
>>> sort of TermsFilter, it may not scale linearly. What I would want to do,
>>> preferably, is to have a hook inside the ES query, so that I can, at query
>>> time, inject the required filter values. Since the filter values have to be
>>> recognized by Lucene, and I will not be indexing them, I will need to do
>>> some quick mapping to get those fields and map them quickly to some field
>>> in Lucene that I can save in the filter. I am not sure whether we can
>>> access and set Lucene DocIDs in the filter or whether they are even exposed
>>> in ES.
>>>
>>> Please assist with this query. Thanks,
>>>
>>> Thanks,
>>> Sandeep
>>>
>>>
>>> On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:
>>>
>>>> Maybe I do not fully understand, but in a client, you can fetch the
>>>> required filter terms from any external source before a JSON query is
>>>> constructed?
>>>>
>>>> Can you give an example what you want to achieve?
>>>>
>>>> Jörg
>>>>
>>>>
>>>> On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
>>>> elasticsearch <[email protected]> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am new to ES and I have the following requirement:
>>>>> I need to specify a list of strings as a filter that applies to a
>>>>> specific field in the document. Like what a filter does, but instead of
>>>>> sending them on the query, I would like them to be populated from an
>>>>> external sources, like a DB or something. Can you please guide me to the
>>>>> relevant examples or references to achieve this on v1.1.2?
>>>>>
>>>>> Thanks,
>>>>> Sandeep
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to