IDF is a simple measure to calculate. So, if building a separate index for
each user is not an ideal solution, then I suggest you could try to
calculate these statistics upfront. Just maintain these statistics for each
user, then use them in the query process.

As the search time, you use these stats in your ranking. One possible way
is to write a similarity wrapper that will read the needed information from
a hash map.

Regards
Ameer



On Wed, 4 Dec 2019 at 00:55, Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:

> >
> > it is enough to give each its own field.
> >
>
> I kind of over-simplified the problem at hand. Apologies.
>
> DOC_TYPE is just one aspect of the problem. The other one is that, it is
> actually shared index where there are multiple-users (100-3000 users per
> index). There are many hundreds of such shared-indexes in our cluster
>
> Search happens per-user & it doesn't make sense to have a single IDF. We
> are ideally looking at some lucene extensions/tricks to store & retrieve
> IDF in <User/DOC_TYPE> pairs.
>
> Is there any reason why you are not storing each DOC_TYPE in its own index?
>
>
> There are some common-fields across all DOC_TYPES (Ex: content/attachment
> et al..)  & to provide unified-search for a user, we colocate them in a
> single index
>
> --
> Ravi
>
> On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarel...@bloomberg.net> wrote:
>
> > Hi Ravi,
> > Can you give more details on how you store an entity into lucene? what is
> > a doc type?
> > what fields do you have?
> >
> > Cheers
> >
> > From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> > java-user@lucene.apache.org
> > Subject: Multi-IDF for a single term possible?
> >
> > Hello,
> >
> > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > entities (DOC_TYPES) are crunched & stored together in a single index.
> >
> > When it comes to IDF, I find that there is a single value computed across
> > documents & stored as part of TermStats, whereas our documents are not
> > homogeneous. So, a single IDF value doesn't work for us
> >
> > We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> > later use the paired-IDF values during query time. Is something like this
> > possible via Codecs or other mechanisms?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
> >
> >
> >
>

Reply via email to