Re: Getting multi-values to use in filter?

Rob Audenaerde Tue, 29 Apr 2014 00:06:43 -0700

Hi Shai,

I read the article on your blog, thanks for it! It seems to be a natural fit to 
do multi-values like this, and it is helpful indeed. For my specific problem, I 
have multiple values that do not have a fixed number, so it can be either 0 or 
10 values. I think the best way to solve this is to encode the number of values 
as first entry in the BDV. This is not that hard so I will take this road.


-Rob


> Op 27 apr. 2014 om 21:27 heeft Shai Erera <ser...@gmail.com> het volgende 
> geschreven:
> 
> Hi Rob,
> 
> Your question got me interested, so I wrote a quick prototype of what I
> think solves your problem (and if not, I hope it solves someone else's!
> :)). The idea is to write a special ValueSource, e.g. MaxValueSource which
> reads a BinadyDocValues, decodes the values and returns the maximum one. It
> can then be embedded in an expression quite easily.
> 
> I published a post on Lucene expressions and included some prototype code
> which demonstrates how to do it. Hope it's still helpful to you:
> http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html.
> 
> Shai
> 
> 
>> On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera <ser...@gmail.com> wrote:
>> 
>> I don't think that you should use the facet module. If all you want is to
>> encode a bunch of numbers under a 'foo' field, you can encode them into a
>> byte[] and index them as a BDV. Then at search time you get the BDV and
>> decode the numbers back. The facet module adds complexity here: yes, you
>> get the encoding/decoding for free, but at the cost of adding mock
>> categories to the taxonomy, or use associations, for no good reason IMO.
>> 
>> Once you do that, you need to figure out how to extend the expressions
>> module to support a function like maxValues(fieldName) (cannot use 'max'
>> since it's reserved). I read about it some, and still haven't figured out
>> exactly how to do it. The JavascriptCompiler can take custom functions to
>> compile expressions, but the methods should take only double values. So I
>> think it should be some sort of binding, but I'm not sure yet how to do it.
>> Perhaps it should be a name like max_fieldName, which you add a custom
>> Expression to as a binding ... I will try to look into it later.
>> 
>> Shai
>> 
>> 
>> On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde 
>> <rob.audenae...@gmail.com>wrote:
>> 
>>> Thanks for all the questions, gives me an opportunity to clarify it :)
>>> 
>>> I want the user to be able to give a (simple) formula (so I don't know it
>>> on beforehand) and use that formula in the search. The Javascript
>>> expressions are really powerful in this use case, but have the
>>> single-value
>>> limitation. Ideally, I would like to make it really flexible by for
>>> example
>>> allowing (in-document aggregating) expressions like: max(fieldA) - fieldB
>>>> 
>>> fieldC.
>>> 
>>> Currently, using single values, I can handle expressions in the form of
>>> "fieldA - fieldB - fieldC > 0" and evaluate the long-value that I receive
>>> from the FunctionValues and the ValueSource. I also optimize the query by
>>> assuring the field exists and has a value, etc. to the search still fast
>>> enough. This works well, but single value only.
>>> 
>>> I also looked into the facets Association Fields, as they somewhat look
>>> like the thing that I want. Only in the faceting module, all ordinals and
>>> values are stored in one field, so there is no easy way extract the fields
>>> that are used in the expression.
>>> 
>>> I like the solution one you suggested, to add all the numeric fields an
>>> encoded byte[] like the facets do, but then on a per-field basis, so that
>>> each numeric field has a BDV field that contains all multiple values for
>>> that field for that document.
>>> 
>>> Now that I am typing this, I think there is another way. I could use the
>>> faceting module and add a different facet field ($facetFIELDA,
>>> $facetFIELDB) in the FacetsConfig for each field. That way it would be
>>> relatively straightforward to get all the values for a field, as they are
>>> exact all the values for the BDV for that document's facet field. Only
>>> aggregating all facets will be harder, as the
>>> TaxonomyFacetSum*Associations
>>> would need to do this for all fields that I need facet counts/sums for.
>>> 
>>> What do you think?
>>> 
>>> -Rob
>>> 
>>> 
>>>> On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera <ser...@gmail.com> wrote:
>>>> 
>>>> A NumericDocValues field can only hold one value. Have you thought about
>>>> encoding the values in a BinaryDocValues field? Or are you talking about
>>>> multiple fields (different names), each has its own single value, and at
>>>> search time you sum the values from a different set of fields?
>>>> 
>>>> If it's one field, multiple values, then why do you need to separate the
>>>> values? Is it because you sometimes sum and sometimes e.g. avg? Do you
>>>> always include all values of a document in the formula, but the formula
>>>> changes between searches, or do you sometimes use only a subset of the
>>>> values?
>>>> 
>>>> If you always use all values, but change the formula between queries,
>>> then
>>>> perhaps you can just encode the pre-computed value under different NDV
>>>> fields? If you only use a handful of functions (and they are known in
>>>> advance), it may not be too heavy on the index, and definitely perform
>>>> better during search.
>>>> 
>>>> Otherwise, I believe I'd consider indexing them as a BDV field. For
>>> facets,
>>>> we basically need the same multi-valued numeric field, and given that
>>> NDV
>>>> is single valued, we went w/ BDV.
>>>> 
>>>> If I misunderstood the scenario, I'd appreciate if you clarify it :)
>>>> 
>>>> Shai
>>>> 
>>>> 
>>>> On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde <
>>> rob.audenae...@gmail.com
>>>>> wrote:
>>>> 
>>>>> Hi Shai, all,
>>>>> 
>>>>> I am trying to write that Filter :). But I'm a bit at loss as how to
>>>>> efficiently grab the multi-values. I can access the
>>>>> context.reader().document() that accesses the storedfields, but that
>>>> seems
>>>>> slow.
>>>>> 
>>>>> For single-value fields I use a compiled JavaScript Expression with
>>>>> simplebindings as ValueSource, which seems to work quite well. The
>>>> downside
>>>>> is that I cannot find a way to implement multi-value through that
>>>> solution.
>>>>> 
>>>>> These create for example a LongFieldSource, which uses the
>>>>> FieldCache.LongParser. These parsers only seem te parse one field.
>>>>> 
>>>>> Is there an efficient way to get -all- of the (numeric) values for a
>>>> field
>>>>> in a document?
>>>>> 
>>>>> 
>>>>>> On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera <ser...@gmail.com> wrote:
>>>>>> 
>>>>>> You can do that by writing a Filter which returns matching documents
>>>>> based
>>>>>> on a sum of the field's value. However I suspect that is going to be
>>>>> slow,
>>>>>> unless you know that you will need several such filters and can
>>> cache
>>>>> them.
>>>>>> 
>>>>>> Another approach would be to write a Collector which serves as a
>>>> Filter,
>>>>>> but computes the sum only for documents that match the query.
>>> Hopefully
>>>>>> that would mean you compute the sum for less documents than you
>>> would
>>>>> have
>>>>>> w/ the Filter approach.
>>>>>> 
>>>>>> Shai
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov <
>>>>>> msoko...@safaribooksonline.com> wrote:
>>>>>> 
>>>>>>> This isn't really a good use case for an index like Lucene.  The
>>> most
>>>>>>> essential property of an index is that it lets you look up
>>> documents
>>>>> very
>>>>>>> quickly based on *precomputed* values.
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> 
>>>>>>>> On 04/23/2014 06:56 AM, Rob Audenaerde wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> I'm looking for a way to use multi-values in a filter.
>>>>>>>> 
>>>>>>>> I want to be able to search on  sum(field)=100, where field has
>>>> values
>>>>>> in
>>>>>>>> one documents:
>>>>>>>> 
>>>>>>>> field=60
>>>>>>>> field=40
>>>>>>>> 
>>>>>>>> In this case 'field' is a LongField. I examined the code in the
>>>>>>>> FieldCache,
>>>>>>>> but that seems to focus on single-valued fields only, or
>>>>>>>> 
>>>>>>>> 
>>>>>>>> It this something that can be done in Lucene? And what would be a
>>>> good
>>>>>>>> approach?
>>>>>>>> 
>>>>>>>> Thanks in advance,
>>>>>>>> 
>>>>>>>> -Rob
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Getting multi-values to use in filter?

Reply via email to