Hello Shai,

Thank you for the feedback! I'll try to answer each of the questions.

> will it change the API in non-backward compatible way, or impact faceted 
> search performance for the common case?

The new API could overload FacetsConfig.build or provide a new method in
TaxonomyWriter to plug in ordinal data. It doesn't have to change the
functionality that already exists. A taxonomy index in the common case would be
indistinguishable before and after this change.

> Do you intend to support arbitrary signals, or only numeric ones?

This is a crucial question. I'd like to take one small step forward and leave
room for us to make improvements later. There's two approaches we could take
initially, which I think you've already identified in your email:

1. Allow only updatabe DocValues as ordinal data. This could become limiting at
some point, but maybe it's a good first solution.

2. Disallow updating ordinal data. New ordinal data can only come in when a new
taxonomy gets built.

For the Amazon product search use case, option 2 is slightly better. We would
build new indexes more often than we would get ordinal data updates. But I'm
not sure what the better option is in the general case. This is where I'd like
feedback from other users. Maybe there's also some other approach I haven't
thought of.

> Have you considered an alternative implementation of pulling that info from 
> another source during retrieval?

Yes, we've considered things like a local database or a separate index.
I haven't done a performance test, but my guess is that having the ordinal
data in the taxonomy is as fast as it gets for use-cases like the faceting
aggregation example in my previous email. Even if that isn't the case, the
taxonomy solution is more convenient and less burdensome from an operational
standpoint.


I hope that's useful. Thanks again for the feedback,

Stefan

On Thu, 11 May 2023 at 16:53, Shai Erera <ser...@gmail.com> wrote:
>
> Hi Stefan,
>
> This sounds interesting and useful. It's like static scores for Lucene 
> documents, only that we will apply them to ordinals. Since I assume it's not 
> a very common use case though, do you know if this new functionality affects 
> existing use cases? For example, will it change the API in non-backward 
> compatible way, or impact faceted search performance for the common case?
>
> Do you intend to support arbitrary signals, or only numeric ones? Numeric 
> signals will allow you to efficiently update the taxonomy index's ordinal 
> documents without updating the documents themselves (which will change their 
> ordinal!!). Other signals don't support this sort of update (yet), so you 
> might run into the issue of not being able to update them. And at least for 
> the author-citation-signal, that's definitely something you'll want to update 
> (unless you rebuild the index from time to time, when the signals are 
> updated).
>
> Have you considered an alternative implementation of pulling that info from 
> another source during retrieval? Just curious what would be the performance 
> implications, since an alternative source can give you the flexibility of 
> supporting other signals which are more complicated to update, but won't 
> affect the taxonomy index.
>
> Generally though, I don't see a reason not to support it.
>
> Shai
>
> On Thu, May 11, 2023 at 1:03 PM Stefan Vodita <stefan.vod...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I work on the Lucene product search team at Amazon. We’ve been considering
>> indexing scoring signals for ordinals into the taxonomy, which could reduce
>> index size for some use-cases.
>>
>> Example
>>
>> Let's consider a library of research papers, where each paper is represented 
>> by
>> a Lucene document and the paper's author is a facet field in that document. 
>> For
>> each author we store the total number of citations. We want to compute a
>> measure of each author's impact, the total number of citations divided by
>> the number of articles published.
>>
>> Implementation
>>
>> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
>> currently support storing data about an ordinal, but the taxonomy is itself a
>> Lucene index, where each ordinal is represented by a document. Right now, the
>> ordinal document has only a few fields allowing it to model the taxonomy
>> structure, but we could conceivably add arbitrary fields to the ordinal
>> documents. We would index the total number of citations an author has as a
>> DocValue in the corresponding ordinal document.
>>
>> Advantages
>>
>> The alternative would be to denormalize data about the authors and have it on
>> each doc that references that author. This leads to duplication. Since Lucene
>> already has a document representation of the author (the ordinal doc), it
>> makes sense conceptually that data about the author should be associated
>> with the ordinal doc.
>>
>>
>> I'm curious if anyone else has tried something like this and if the approach
>> seems reasonable. I’ve made an attempt to code it and I can open a PR if this
>> sounds like a useful feature.
>>
>> Stefan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to