Re: Converting docid to uid

Marc Davenport Wed, 11 Sep 2024 16:45:42 -0700

Hello,
To close the circle on this.  I found that using the docvalues for this was
on the order of 10x faster for our purposes.
Snippet that someone might find useful in the future:


        LeafReaderContext context =
indexReader.leaves().get(ReaderUtil.subIndex(docId, indexReader.leaves()));
        NumericDocValues uidDocValues =
context.reader().getNumericDocValues(LISTING_ID_FIELDNAME);
        if (uidDocValues != null && uidDocValues.advanceExact(docId -
context.docBase)) {
            return uidDocValues.longValue();
        }
        // not found, handle error

Previous approach using stored field for comparison:

        Set<String> listingIdField = ImmutableSet.of(LISTING_ID_FIELDNAME);
        Document d = doc(docId, listingIdField);
        IndexableField value = d.getField(LISTING_ID_FIELDNAME);
        if (value != null && value.numericValue() != null) {
            return value.numericValue().longValue();
        }
        // not found, handle error

I ran the two in parallel for a while to make sure that the resulting value
matched and never had one miss.

We did end up caching the relationship, but the cache is bundled with the
wrapper to the index reader which is switched out at every update/commit.
This has been stable for us.  The cache went in before I was able to switch
retrieving through the numericDocValues. We probably wouldn't have added
the cache if we had already made the switch.

Thank Michael et al!
Marc



On Tue, Aug 6, 2024 at 5:07 PM Michael Sokolov <msoko...@gmail.com> wrote:

> You could switch to DocValues, and it would probably be more efficient
> if you are only retrieving a single stored field but you have a lot of
> other ones in the index since stored fields are stored together and
> have to be decoded together.  As far as visiting every segment on disk
> I'm not sure what you mean -- if you mean Lucene segments this isn't
> really germane since both DocValues and StoredFields are stored in
> segments and the segment visitation pattern is going to be the same
> for both.  Finally re: caching I wouldn't recommend it but the
> approach you describe could be made to work.  To be efficient about it
> you'd want to have a per-segment cache, but this gets into poking
> around in Lucene internals - you might as well just use its data
> structures; they should be pretty efficient and a cache wouldn't
> likely win you very much and just lead to trouble,
>
> On Mon, Aug 5, 2024 at 12:08 PM Marc Davenport
> <madavenp...@cargurus.com.invalid> wrote:
> >
> > Hello,
> > Right now our implementation retrieves our UID for our records from the
> > topdocs by calling IndexSearcher.doc(docid, fieldToLoad) (Deprecated)
> with
> > the UID as the only field.  I'm looking to replace this with the
> > appropriate call to IndexSearcher.storedFields().  This feels a little
> > inefficient since it could be visiting each segment on disk.  Some of our
> > searches are for an abusive 10k results (I would love to change this, but
> > that's the system as it exists).   If I only want a single field, is
> there
> > a better way to retrieve these? Should I be retrieving the numeric
> docvals
> > and advancing the iterator?   I've been looking around the internet for
> > different strategies and I'm kind of muddled with older blog posts vs the
> > current state of the codebase.   Can I cache the relationship between the
> > docId and the UID as long as I clear it whenever I commit?
> > Thank you,
> > Marc
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Converting docid to uid

Reply via email to