Re: Understanding Document ID (Lucene 10.0.0)

Michael Froh Fri, 25 Oct 2024 09:53:00 -0700

Hi Prashant,

For your particular use-case, you probably don't need to join across
multiple indices.

Lucene is able to maintain multiple data structures per field, with the
selection of data structures coming from attributes of the field's type. If
you have a field that you want to return, but doesn't need to be searchable
(like your HTML report), you can add it as an unindexed string field that's
stored. That will write it to the stored fields data structure (which is
used to populate search results), but won't build a full-text index for it.

The slight downside of that approach is that all stored fields for a
document are compressed and written together. If users mostly just want the
name, age, and city fields (and only rarely care about the report field),
then maybe storing it in a separate index might make sense. In that case,
adding an ID keyword field to both indices is a viable option. Doing a term
query on the secondary index to find the appropriate docs should generally
be quite fast -- while Lucene is not primarily a key-value store, it works
surprisingly well as one.

Hope that helps,
Froh

On Fri, Oct 25, 2024 at 8:28 AM Prashant Saxena <animator...@gmail.com>
wrote:

> I'm new to Lucene and trying to understand the concept of unique document
> id, something like a primary key in databases like sql or sqlite etc.
> While searching, I came across this article:
> https://blog.mikemccandless.com/2014/05/choosing-which actually
> fast-unique-identifier-uuid.html
> <
> https://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
> >
> which actually is quite old but it's been said elsewhere, it's still
> applicable on the latest version as internally things
> are not changed much in Lucene.
>
> What have I done so far?
>
> I have created a simple index where few text files are written as
> documents, when I open this index in Luke (GUI), on the *Document* tab,
> I see an option along with a spinner control:
>
> *Browse document by Doc # |          0| in 100 docs*
>
> If you change the value, it shows the document at the bottom of the GUI.
> The id seems to be a number in which documents are stored.
>
> *Question:*
> How can one access this id?
>
> Why do I need a unique id?
>
> Let's assume I have created a simple index with three fields: name, age &
> city. There is another field, the associated long html text,
> which I am writing in another index. In a GUI environment where users can
> search by typing the search term in four of the fields.
>
> name |                                  |
> age    |                                  |
> city    |                                  |
> report |                                |
> Usually, people are interested in the first three fields. Report field is
> not used as much but still available if somebody is interested.
>
> *Option 1*
> When the user is searching only using name, age, city, I'll open the first
> index, do the search, get the documents and their ids, get the report field
> directly from the second index using the id. This way
> no searching is required in the second index.
>
> *Option 2*
> I have recently started learning Lucene and right now I haven't touch the
> joining part but still here is the question
> If a user has given a search term in all of the four fields then logically
> you have to search in both the indexes and find the common doc ids in both
> searched results.
>
> *Question*
> How will this joining happen, to get the correct results from both the
> indexes? If possible please refer to some online code example links.
>
> Prashant
>

Re: Understanding Document ID (Lucene 10.0.0)

Reply via email to