Using more than one index

Mile Rosu Mon, 12 Jun 2006 02:22:58 -0700

Hello,

We have an application dealing with historical books. The books have
metadata consisting of event dates, and person names among others.
The FullText, Person and Date indexes were split until we realized that
for a larger number of documents (400K) the combination of the
sequential search hits took a way too long time to complete (15 min).
The date index was built using the suggestion found at:
http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (big
thanks for the hint)


Is there a recommended approach to combining results from different
indexes (with different fields)?

The indexes structure:
MainIndex:
        Fields:
        @ID@ - keyword (document id)
                @FULLTEXT@ - tokenized (used for full text6 search)
                Ptitle - tokenized (used for full text publication title
search)
                Dtitle - tokenized (used for full text document title
search)
                Type - keyword - (used for document type)

PersonIndex:
                @ID@ - keyword (document id == [EMAIL PROTECTED]@)
                Person - tokenized (full text person name search)
DateIndex:
                @ID@ - keyword (document id == [EMAIL PROTECTED]@)
                Date - date as YYYYMMDD - keyword
                Type - type of date (document date, birth day, etc...)
                @YYYY@ - year of date
                @YYYYMM@ - year and month of date
                @DDD@ - decade
                @CC@ - century of date


Eg:
If I want to search for documents that contain: person "John", full text
"book" and date: before 06/12/2005 
Step 1:  search in personIndex for John - retrieve all @ID@ from the hit
list
Step 2: search in DateIndex for documents that have dates before
06/12/2005 - retrieve id from the hit list 
Step 3: search in mainIndex for "book" - retrieve all @ID@ 
Step 4: combine all the lists 
Step 5: search mainIndex for documents with the @ID@ from the combined
id list

Each search takes less then 1 second, but retrieving @ID@ from the index
takes a lot more - the time increases by the number of hits. This is
because when retrieving a field value from a document hit, the Lucene
engine loads all the fields from the index (the entire document). So if
in one search I get 300.000 hits cont, I have to iterate through all and
retrieve the @ID@ field value - this takes a lot of time.

Regards,
Mile Rosu

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Using more than one index

Reply via email to