Re: Multi-valued xxValue / xxValueSource implementations?
Thanks Robert for all your thoughts and context! > I feel that things like facets apis should really try to move to lower-level > apis (DoubleValuesSource, SortedSetDocValues, etc) Yeah I think this direction generally makes sense. All the cases I can think of where a user might want to provide custom values (e.g., filtering, transforming, etc.) could be solved by allowing users to pass their own xxDocValues instance into faceting implementations. For example, if a user wanted to provide some filtering or transformation on long values before counting them with LongValueFacetCounts, they could do so by creating their own SortedNumericDocValues / NumericDocValues implementations and passing them in if the faceting implementations supported this. The only possible gap I see here is that implementing xxDocValues requires the ability to provide iteration over the documents themselves, whereas xxValuesSource doesn't. So if there was some case where a user wanted to provide multi-valued data but couldn't provide document iteration, that might be an issue. It's a bit of a funny limitation since faceting doesn't need the value source to lead iteration, so I could see a multi-valued version of something like LongValuesSource maybe being a better fit. Cheers, -Greg On Tue, Oct 26, 2021 at 8:03 PM Robert Muir wrote: > > On Tue, Oct 26, 2021 at 8:01 PM Robert Muir wrote: > > > > Hi Greg, I think the general issue is one of the API, the ValueSource > > seems really geared at returning values from single-valued fields. > > I think really, this is the core issue. This ValueSource thing was > created before the days of docvalues, in a lot of cases will do > inefficient things depending on how you hold it. > > I feel that things like facets apis should really try to move to > lower-level apis (DoubleValuesSource, SortedSetDocValues, etc) > > Reverse the problem around from push to a pull, now if you want to > give "computed field" or similar inputs to faceting (e.g. some kind of > filtering-on-the-fly), you have the chance to implement it > efficiently. > The expressions module switched away from this ValueSource to a > DoubleValues/DoubleValuesSource already, though I didn't follow > specific reasons why. > Maybe similar approaches apply to all the numerics. > > As far as the strings, personally, I'm not sure what a ValueSource API > that "filters/transforms" terms should look like. Seems slow no matter > how you do it. But maybe fresh ideas are needed. > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Glove dictionary?
Yes, I copied some data from those GloVe files into the knn-token-vectors in the demo module. On Wed, Oct 27, 2021 at 2:38 PM Dawid Weiss wrote: > > I'm looking at licenses/pddl-10.txt, trying to figure out what it > applies to. I see this comment: > > * The vector dictionary used in the demo is taken from the GloVe > project hosted at > * https://nlp.stanford.edu/projects/glove, whose data is in the public > domain, as described by > * http://opendatacommons.org/licenses/pddl/1.0, available in the > Lucene distribution as > * lucene/licenses/pddl-10.txt. > > But I don't think we're using the data anywhere? Is this the test > resource knn-token-vectors that this applies to? Mike (Sokolov) - do > you know? > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Glove dictionary?
I'm looking at licenses/pddl-10.txt, trying to figure out what it applies to. I see this comment: * The vector dictionary used in the demo is taken from the GloVe project hosted at * https://nlp.stanford.edu/projects/glove, whose data is in the public domain, as described by * http://opendatacommons.org/licenses/pddl/1.0, available in the Lucene distribution as * lucene/licenses/pddl-10.txt. But I don't think we're using the data anywhere? Is this the test resource knn-token-vectors that this applies to? Mike (Sokolov) - do you know? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Is it Time to Deprecate the Legacy Facets API
> Personally I'd love to see us stop maintaining the duplicated code of > the underlying implementations. I wouldn't mind losing the legacy > syntax as well - I'll take a clear, verbose API over a less-clear, > concise one any day. But I'm probably a minority there. +1, agree with Jason here, fully. On Wed, Oct 27, 2021 at 8:37 PM Ishan Chattopadhyaya < ichattopadhy...@gmail.com> wrote: > Should we deprecate classic faceting in 9x now? > > > It's worth investigating deprecating the stats component also. I believe > JSON facets covers that functionality as well. It will be painful for users > though to switch over unfortunately. > > +1, lets deprecate stats component too. > > > On Thu, Jan 28, 2021 at 5:22 AM Joel Bernstein wrote: > >> It's worth investigating deprecating the stats component also. I believe >> JSON facets covers that functionality as well. It will be painful for users >> though to switch over unfortunately. >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> >> On Fri, Jan 22, 2021 at 1:14 PM Jason Gerlowski >> wrote: >> >>> Personally I'd love to see us stop maintaining the duplicated code of >>> the underlying implementations. I wouldn't mind losing the legacy >>> syntax as well - I'll take a clear, verbose API over a less-clear, >>> concise one any day. But I'm probably a minority there. >>> >>> Either way I agree with Michael when he said above that the first step >>> would have to be a parity investigation for features and performance. >>> >>> Best, >>> >>> Jason >>> >>> On Fri, Jan 22, 2021 at 10:05 AM Michael Gibney >>> wrote: >>> > >>> > I agree it would make long-term sense to consolidate the backend >>> implementation. I think leaving the "classic" user-facing facet API (with >>> JSON Facet module as a backend) would be a good idea. Either way, I think a >>> first step would be checking for parity between existing backend >>> implementations -- possibly in terms of features [1], but certainly in >>> terms of performance for common use cases [2]. >>> > >>> > I think removal of the "classic" user-facing API would cause a lot of >>> consternation in the user community. I can even see a >>> non-backward-compatibility argument for preserving the "classic" >>> user-facing API: it's simpler for simple use cases. _If_ the ultimate goal >>> is removal of the "classic" user-facing API (not presuming that it is), >>> that approach could be facilitated in the short term by enticing users >>> towards "JSON Facet" API ... basically with a "feature freeze" on the >>> legacy implementation. No new features [3], no new optimizations [4] for >>> "classic"; concentrate such efforts on JSON Facet. This seems to already be >>> the de facto case, but it could be a more intentional decision -- e.g. in >>> [3] it's straightforward to extend the the proposed "facet cache" to the >>> "classic" impl ... but I could see an argument for intentionally not doing >>> so. >>> > >>> > Robert, I think your concerns about UninvertedField could be addressed >>> by the `uninvertible="false"` property (currently defaults to "true" for >>> backward compatibility iiuc; but could default to "false", or at least >>> provide the ability to set the default for all fields to "false" at node >>> level solr.xml? -- I know I've wished for the latter!). Also fwiw I'm not >>> aware of any JSON Facet processors that work with string values in RAM ... >>> I do think all JSON Facet processors use OrdinalMap now, where relevant. >>> > >>> > [1] https://issues.apache.org/jira/browse/SOLR-14921 >>> > [2] https://issues.apache.org/jira/browse/SOLR-14764 >>> > [3] https://issues.apache.org/jira/browse/SOLR-13807 >>> > [4] https://issues.apache.org/jira/browse/SOLR-10732 >>> > >>> > On Fri, Jan 22, 2021 at 12:46 AM Robert Muir wrote: >>> >> >>> >> Do these two options conflate concerns of input format vs. actual >>> >> algorithm? That was always my disappointment. >>> >> >>> >> I feel like the java apis are off here at the lower level, and it >>> >> hurts the user. >>> >> I don't talk about the input format from the user, instead I mean the >>> >> execution of the faceting query. >>> >> >>> >> IMO: building top-level caches (e.g. uninvertedfield) or >>> >> on-the-fly-caches (e.g. fieldcache) is totally trappy already. >>> >> But with the uninvertedfield of json facets it does its own thing, >>> >> even if you went thru the trouble to enable docvalues at index time: >>> >> that's sad. >>> >> >>> >> the code by default should not give the user jvm >>> >> heap/garbage-collector hell. If you want to do that to yourself, for a >>> >> totally static index, IMO that should be opt-in. >>> >> >>> >> But for the record, it is no longer just two shitty choices like >>> >> "top-level vs per-segment". There are different field types, e.g. >>> >> numeric types where the per-segment approach works efficiently. >>> >> Then you have the strings, but there is a newish middle ground for >>> >> Strings: OrdinalMap (lucene Multi*
Re: Is it Time to Deprecate the Legacy Facets API
Should we deprecate classic faceting in 9x now? > It's worth investigating deprecating the stats component also. I believe JSON facets covers that functionality as well. It will be painful for users though to switch over unfortunately. +1, lets deprecate stats component too. On Thu, Jan 28, 2021 at 5:22 AM Joel Bernstein wrote: > It's worth investigating deprecating the stats component also. I believe > JSON facets covers that functionality as well. It will be painful for users > though to switch over unfortunately. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Fri, Jan 22, 2021 at 1:14 PM Jason Gerlowski > wrote: > >> Personally I'd love to see us stop maintaining the duplicated code of >> the underlying implementations. I wouldn't mind losing the legacy >> syntax as well - I'll take a clear, verbose API over a less-clear, >> concise one any day. But I'm probably a minority there. >> >> Either way I agree with Michael when he said above that the first step >> would have to be a parity investigation for features and performance. >> >> Best, >> >> Jason >> >> On Fri, Jan 22, 2021 at 10:05 AM Michael Gibney >> wrote: >> > >> > I agree it would make long-term sense to consolidate the backend >> implementation. I think leaving the "classic" user-facing facet API (with >> JSON Facet module as a backend) would be a good idea. Either way, I think a >> first step would be checking for parity between existing backend >> implementations -- possibly in terms of features [1], but certainly in >> terms of performance for common use cases [2]. >> > >> > I think removal of the "classic" user-facing API would cause a lot of >> consternation in the user community. I can even see a >> non-backward-compatibility argument for preserving the "classic" >> user-facing API: it's simpler for simple use cases. _If_ the ultimate goal >> is removal of the "classic" user-facing API (not presuming that it is), >> that approach could be facilitated in the short term by enticing users >> towards "JSON Facet" API ... basically with a "feature freeze" on the >> legacy implementation. No new features [3], no new optimizations [4] for >> "classic"; concentrate such efforts on JSON Facet. This seems to already be >> the de facto case, but it could be a more intentional decision -- e.g. in >> [3] it's straightforward to extend the the proposed "facet cache" to the >> "classic" impl ... but I could see an argument for intentionally not doing >> so. >> > >> > Robert, I think your concerns about UninvertedField could be addressed >> by the `uninvertible="false"` property (currently defaults to "true" for >> backward compatibility iiuc; but could default to "false", or at least >> provide the ability to set the default for all fields to "false" at node >> level solr.xml? -- I know I've wished for the latter!). Also fwiw I'm not >> aware of any JSON Facet processors that work with string values in RAM ... >> I do think all JSON Facet processors use OrdinalMap now, where relevant. >> > >> > [1] https://issues.apache.org/jira/browse/SOLR-14921 >> > [2] https://issues.apache.org/jira/browse/SOLR-14764 >> > [3] https://issues.apache.org/jira/browse/SOLR-13807 >> > [4] https://issues.apache.org/jira/browse/SOLR-10732 >> > >> > On Fri, Jan 22, 2021 at 12:46 AM Robert Muir wrote: >> >> >> >> Do these two options conflate concerns of input format vs. actual >> >> algorithm? That was always my disappointment. >> >> >> >> I feel like the java apis are off here at the lower level, and it >> >> hurts the user. >> >> I don't talk about the input format from the user, instead I mean the >> >> execution of the faceting query. >> >> >> >> IMO: building top-level caches (e.g. uninvertedfield) or >> >> on-the-fly-caches (e.g. fieldcache) is totally trappy already. >> >> But with the uninvertedfield of json facets it does its own thing, >> >> even if you went thru the trouble to enable docvalues at index time: >> >> that's sad. >> >> >> >> the code by default should not give the user jvm >> >> heap/garbage-collector hell. If you want to do that to yourself, for a >> >> totally static index, IMO that should be opt-in. >> >> >> >> But for the record, it is no longer just two shitty choices like >> >> "top-level vs per-segment". There are different field types, e.g. >> >> numeric types where the per-segment approach works efficiently. >> >> Then you have the strings, but there is a newish middle ground for >> >> Strings: OrdinalMap (lucene Multi* interfaces do it) which builds >> >> top-level integers structures to speed up string-faceting, but doesnt >> >> need *string values* in ram. >> >> It is just integers and mostly compresses as deltas. Adrien compresses >> >> the shit out of it. >> >> >> >> So I'd hate for the user to lose the option here of using docvalues to >> >> keep faceting out of heap memory, which should not be hassling them >> >> already in 2021. >> >> Maybe better to refactor the code such that all these concerns aren't >>