Atri, in the abstract it sounds like a great idea, but in practice it will only be as good as the data that drives it. I think that to make this work it would be a good idea to write up a proposal of some sort targeting different open (or commercial, although I doubt you would get much of this) source projects that use lucene-based search asking them to contribute their data.
Also can we learn anything from the previous attempt? What did they try? How can this effort about the same pitfalls? Even with document and query data, you still need some kind of relevance ground truth, and this is notoriously difficult to get. Probably click through stats are the most generic proxy for that. So as a thought experiment, maybe contact Wikipedia and ask if they would be willing to share some sample of queries and logs. Or did you have another idea how to drive this? Then with one pilot participant, you could maybe get others to join. I think if you have some commitments, or at least serious expression of interest, from data providers, then you can start to think about what to actually do with the data, but I would start there? On Mon, Jun 10, 2019, 2:54 AM Atri Sharma <a...@apache.org> wrote: > Any thoughts on this? I am envisioning applications to machine > learning systems, where the training dataset might be a small sample > of the entire dataset, and the user wants scoring to be done only on > samples of the dataset. > > On Fri, Jun 7, 2019 at 5:45 PM Atri Sharma <a...@apache.org> wrote: > > > > Hi All, > > > > While working on a new Query type, I was inclined to think of a couple > > of use cases where the documents being scored need not be all of the > > data set, but a sample of them. This can be useful for very large > > datasets, where a query is only interested in getting the "feel" of > > the data, and other queries where the data is being aggregated over > > time, so a wide enough sample of the data is good enough for the user > > at the tradeoff of improved performance. Faceting already has sampling > > mechanisms, so there are ideas to be borrowed from that part. > > > > I have some ideas on introducing a new query type and associated > > semantics to allow this functionality to be present from ground up. > > Specifically, a query type which wraps another query and "feeds" > > offsets to the inner query, along with a limit of collection of hits. > > I can go in more detail, but wanted to get some thoughts and feedback > > before delving deeper. > > > > Atri > > > > -- > Regards, > > Atri > Apache Concerted > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >