Re: Multi-valued xxValue / xxValueSource implementations?

2021-10-27 Thread Greg Miller
Thanks Robert for all your thoughts and context!

> I feel that things like facets apis should really try to move to lower-level 
> apis (DoubleValuesSource, SortedSetDocValues, etc)

Yeah I think this direction generally makes sense. All the cases I can
think of where a user might want to provide custom values (e.g.,
filtering, transforming, etc.) could be solved by allowing users to
pass their own xxDocValues instance into faceting implementations. For
example, if a user wanted to provide some filtering or transformation
on long values before counting them with LongValueFacetCounts, they
could do so by creating their own SortedNumericDocValues /
NumericDocValues implementations and passing them in if the faceting
implementations supported this.

The only possible gap I see here is that implementing xxDocValues
requires the ability to provide iteration over the documents
themselves, whereas xxValuesSource doesn't. So if there was some case
where a user wanted to provide multi-valued data but couldn't provide
document iteration, that might be an issue. It's a bit of a funny
limitation since faceting doesn't need the value source to lead
iteration, so I could see a multi-valued version of something like
LongValuesSource maybe being a better fit.

Cheers,
-Greg

On Tue, Oct 26, 2021 at 8:03 PM Robert Muir  wrote:
>
> On Tue, Oct 26, 2021 at 8:01 PM Robert Muir  wrote:
> >
> > Hi Greg, I think the general issue is one of the API, the ValueSource
> > seems really geared at returning values from single-valued fields.
>
> I think really, this is the core issue. This ValueSource thing was
> created before the days of docvalues, in a lot of cases will do
> inefficient things depending on how you hold it.
>
> I feel that things like facets apis should really try to move to
> lower-level apis (DoubleValuesSource, SortedSetDocValues, etc)
>
> Reverse the problem around from push to a pull, now if you want to
> give "computed field" or similar inputs to faceting (e.g. some kind of
> filtering-on-the-fly), you have the chance to implement it
> efficiently.
> The expressions module switched away from this ValueSource to a
> DoubleValues/DoubleValuesSource already, though I didn't follow
> specific reasons why.
> Maybe similar approaches apply to all the numerics.
>
> As far as the strings, personally, I'm not sure what a ValueSource API
> that "filters/transforms" terms should look like. Seems slow no matter
> how you do it. But maybe fresh ideas are needed.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Glove dictionary?

2021-10-27 Thread Michael Sokolov
Yes, I copied some data from those GloVe files into the
knn-token-vectors in the demo module.

On Wed, Oct 27, 2021 at 2:38 PM Dawid Weiss  wrote:
>
> I'm looking at licenses/pddl-10.txt, trying to figure out what it
> applies to. I see this comment:
>
> * The vector dictionary used in the demo is taken from the GloVe
> project hosted at
> * https://nlp.stanford.edu/projects/glove, whose data is in the public
> domain, as described by
> * http://opendatacommons.org/licenses/pddl/1.0, available in the
> Lucene distribution as
> * lucene/licenses/pddl-10.txt.
>
> But I don't think we're using the data anywhere? Is this the test
> resource knn-token-vectors that this applies to? Mike (Sokolov) - do
> you know?
>
> Dawid
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Glove dictionary?

2021-10-27 Thread Dawid Weiss
I'm looking at licenses/pddl-10.txt, trying to figure out what it
applies to. I see this comment:

* The vector dictionary used in the demo is taken from the GloVe
project hosted at
* https://nlp.stanford.edu/projects/glove, whose data is in the public
domain, as described by
* http://opendatacommons.org/licenses/pddl/1.0, available in the
Lucene distribution as
* lucene/licenses/pddl-10.txt.

But I don't think we're using the data anywhere? Is this the test
resource knn-token-vectors that this applies to? Mike (Sokolov) - do
you know?

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Is it Time to Deprecate the Legacy Facets API

2021-10-27 Thread Ishan Chattopadhyaya
> Personally I'd love to see us stop maintaining the duplicated code of
> the underlying implementations.  I wouldn't mind losing the legacy
> syntax as well - I'll take a clear, verbose API over a less-clear,
> concise one any day.  But I'm probably a minority there.

+1, agree with Jason here, fully.

On Wed, Oct 27, 2021 at 8:37 PM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Should we deprecate classic faceting in 9x now?
>
> > It's worth investigating deprecating the stats component also. I believe
> JSON facets covers that functionality as well. It will be painful for users
> though to switch over unfortunately.
>
> +1, lets deprecate stats component too.
>
>
> On Thu, Jan 28, 2021 at 5:22 AM Joel Bernstein  wrote:
>
>> It's worth investigating deprecating the stats component also. I believe
>> JSON facets covers that functionality as well. It will be painful for users
>> though to switch over unfortunately.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Fri, Jan 22, 2021 at 1:14 PM Jason Gerlowski 
>> wrote:
>>
>>> Personally I'd love to see us stop maintaining the duplicated code of
>>> the underlying implementations.  I wouldn't mind losing the legacy
>>> syntax as well - I'll take a clear, verbose API over a less-clear,
>>> concise one any day.  But I'm probably a minority there.
>>>
>>> Either way I agree with Michael when he said above that the first step
>>> would have to be a parity investigation for features and performance.
>>>
>>> Best,
>>>
>>> Jason
>>>
>>> On Fri, Jan 22, 2021 at 10:05 AM Michael Gibney
>>>  wrote:
>>> >
>>> > I agree it would make long-term sense to consolidate the backend
>>> implementation. I think leaving the "classic" user-facing facet API (with
>>> JSON Facet module as a backend) would be a good idea. Either way, I think a
>>> first step would be checking for parity between existing backend
>>> implementations -- possibly in terms of features [1], but certainly in
>>> terms of performance for common use cases [2].
>>> >
>>> > I think removal of the "classic" user-facing API would cause a lot of
>>> consternation in the user community. I can even see a
>>> non-backward-compatibility argument for preserving the "classic"
>>> user-facing API: it's simpler for simple use cases. _If_ the ultimate goal
>>> is removal of the "classic" user-facing API (not presuming that it is),
>>> that approach could be facilitated in the short term by enticing users
>>> towards "JSON Facet" API ... basically with a "feature freeze" on the
>>> legacy implementation. No new features [3], no new optimizations [4] for
>>> "classic"; concentrate such efforts on JSON Facet. This seems to already be
>>> the de facto case, but it could be a more intentional decision -- e.g. in
>>> [3] it's straightforward to extend the the proposed "facet cache" to the
>>> "classic" impl ... but I could see an argument for intentionally not doing
>>> so.
>>> >
>>> > Robert, I think your concerns about UninvertedField could be addressed
>>> by the `uninvertible="false"` property (currently defaults to "true" for
>>> backward compatibility iiuc; but could default to "false", or at least
>>> provide the ability to set the default for all fields to "false" at node
>>> level solr.xml? -- I know I've wished for the latter!). Also fwiw I'm not
>>> aware of any JSON Facet processors that work with string values in RAM ...
>>> I do think all JSON Facet processors use OrdinalMap now, where relevant.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/SOLR-14921
>>> > [2] https://issues.apache.org/jira/browse/SOLR-14764
>>> > [3] https://issues.apache.org/jira/browse/SOLR-13807
>>> > [4] https://issues.apache.org/jira/browse/SOLR-10732
>>> >
>>> > On Fri, Jan 22, 2021 at 12:46 AM Robert Muir  wrote:
>>> >>
>>> >> Do these two options conflate concerns of input format vs. actual
>>> >> algorithm? That was always my disappointment.
>>> >>
>>> >> I feel like the java apis are off here at the lower level, and it
>>> >> hurts the user.
>>> >> I don't talk about the input format from the user, instead I mean the
>>> >> execution of the faceting query.
>>> >>
>>> >> IMO: building top-level caches (e.g. uninvertedfield) or
>>> >> on-the-fly-caches (e.g. fieldcache) is totally trappy already.
>>> >> But with the uninvertedfield of json facets it does its own thing,
>>> >> even if you went thru the trouble to enable docvalues at index time:
>>> >> that's sad.
>>> >>
>>> >> the code by default should not give the user jvm
>>> >> heap/garbage-collector hell. If you want to do that to yourself, for a
>>> >> totally static index, IMO that should be opt-in.
>>> >>
>>> >> But for the record, it is no longer just two shitty choices like
>>> >> "top-level vs per-segment". There are different field types, e.g.
>>> >> numeric types where the per-segment approach works efficiently.
>>> >> Then you have the strings, but there is a newish middle ground for
>>> >> Strings: OrdinalMap (lucene Multi* 

Re: Is it Time to Deprecate the Legacy Facets API

2021-10-27 Thread Ishan Chattopadhyaya
Should we deprecate classic faceting in 9x now?

> It's worth investigating deprecating the stats component also. I believe
JSON facets covers that functionality as well. It will be painful for users
though to switch over unfortunately.

+1, lets deprecate stats component too.


On Thu, Jan 28, 2021 at 5:22 AM Joel Bernstein  wrote:

> It's worth investigating deprecating the stats component also. I believe
> JSON facets covers that functionality as well. It will be painful for users
> though to switch over unfortunately.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jan 22, 2021 at 1:14 PM Jason Gerlowski 
> wrote:
>
>> Personally I'd love to see us stop maintaining the duplicated code of
>> the underlying implementations.  I wouldn't mind losing the legacy
>> syntax as well - I'll take a clear, verbose API over a less-clear,
>> concise one any day.  But I'm probably a minority there.
>>
>> Either way I agree with Michael when he said above that the first step
>> would have to be a parity investigation for features and performance.
>>
>> Best,
>>
>> Jason
>>
>> On Fri, Jan 22, 2021 at 10:05 AM Michael Gibney
>>  wrote:
>> >
>> > I agree it would make long-term sense to consolidate the backend
>> implementation. I think leaving the "classic" user-facing facet API (with
>> JSON Facet module as a backend) would be a good idea. Either way, I think a
>> first step would be checking for parity between existing backend
>> implementations -- possibly in terms of features [1], but certainly in
>> terms of performance for common use cases [2].
>> >
>> > I think removal of the "classic" user-facing API would cause a lot of
>> consternation in the user community. I can even see a
>> non-backward-compatibility argument for preserving the "classic"
>> user-facing API: it's simpler for simple use cases. _If_ the ultimate goal
>> is removal of the "classic" user-facing API (not presuming that it is),
>> that approach could be facilitated in the short term by enticing users
>> towards "JSON Facet" API ... basically with a "feature freeze" on the
>> legacy implementation. No new features [3], no new optimizations [4] for
>> "classic"; concentrate such efforts on JSON Facet. This seems to already be
>> the de facto case, but it could be a more intentional decision -- e.g. in
>> [3] it's straightforward to extend the the proposed "facet cache" to the
>> "classic" impl ... but I could see an argument for intentionally not doing
>> so.
>> >
>> > Robert, I think your concerns about UninvertedField could be addressed
>> by the `uninvertible="false"` property (currently defaults to "true" for
>> backward compatibility iiuc; but could default to "false", or at least
>> provide the ability to set the default for all fields to "false" at node
>> level solr.xml? -- I know I've wished for the latter!). Also fwiw I'm not
>> aware of any JSON Facet processors that work with string values in RAM ...
>> I do think all JSON Facet processors use OrdinalMap now, where relevant.
>> >
>> > [1] https://issues.apache.org/jira/browse/SOLR-14921
>> > [2] https://issues.apache.org/jira/browse/SOLR-14764
>> > [3] https://issues.apache.org/jira/browse/SOLR-13807
>> > [4] https://issues.apache.org/jira/browse/SOLR-10732
>> >
>> > On Fri, Jan 22, 2021 at 12:46 AM Robert Muir  wrote:
>> >>
>> >> Do these two options conflate concerns of input format vs. actual
>> >> algorithm? That was always my disappointment.
>> >>
>> >> I feel like the java apis are off here at the lower level, and it
>> >> hurts the user.
>> >> I don't talk about the input format from the user, instead I mean the
>> >> execution of the faceting query.
>> >>
>> >> IMO: building top-level caches (e.g. uninvertedfield) or
>> >> on-the-fly-caches (e.g. fieldcache) is totally trappy already.
>> >> But with the uninvertedfield of json facets it does its own thing,
>> >> even if you went thru the trouble to enable docvalues at index time:
>> >> that's sad.
>> >>
>> >> the code by default should not give the user jvm
>> >> heap/garbage-collector hell. If you want to do that to yourself, for a
>> >> totally static index, IMO that should be opt-in.
>> >>
>> >> But for the record, it is no longer just two shitty choices like
>> >> "top-level vs per-segment". There are different field types, e.g.
>> >> numeric types where the per-segment approach works efficiently.
>> >> Then you have the strings, but there is a newish middle ground for
>> >> Strings: OrdinalMap (lucene Multi* interfaces do it) which builds
>> >> top-level integers structures to speed up string-faceting, but doesnt
>> >> need *string values* in ram.
>> >> It is just integers and mostly compresses as deltas. Adrien compresses
>> >> the shit out of it.
>> >>
>> >> So I'd hate for the user to lose the option here of using docvalues to
>> >> keep faceting out of heap memory, which should not be hassling them
>> >> already in 2021.
>> >> Maybe better to refactor the code such that all these concerns aren't
>>