Hello, We have an application dealing with historical books. The books have metadata consisting of event dates, and person names among others. The FullText, Person and Date indexes were split until we realized that for a larger number of documents (400K) the combination of the sequential search hits took a way too long time to complete (15 min). The date index was built using the suggestion found at: http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (big thanks for the hint)
Is there a recommended approach to combining results from different indexes (with different fields)? The indexes structure: MainIndex: Fields: @ID@ - keyword (document id) @FULLTEXT@ - tokenized (used for full text6 search) Ptitle - tokenized (used for full text publication title search) Dtitle - tokenized (used for full text document title search) Type - keyword - (used for document type) PersonIndex: @ID@ - keyword (document id == [EMAIL PROTECTED]@) Person - tokenized (full text person name search) DateIndex: @ID@ - keyword (document id == [EMAIL PROTECTED]@) Date - date as YYYYMMDD - keyword Type - type of date (document date, birth day, etc...) @YYYY@ - year of date @YYYYMM@ - year and month of date @DDD@ - decade @CC@ - century of date Eg: If I want to search for documents that contain: person "John", full text "book" and date: before 06/12/2005 Step 1: search in personIndex for John - retrieve all @ID@ from the hit list Step 2: search in DateIndex for documents that have dates before 06/12/2005 - retrieve id from the hit list Step 3: search in mainIndex for "book" - retrieve all @ID@ Step 4: combine all the lists Step 5: search mainIndex for documents with the @ID@ from the combined id list Each search takes less then 1 second, but retrieving @ID@ from the index takes a lot more - the time increases by the number of hits. This is because when retrieving a field value from a document hit, the Lucene engine loads all the fields from the index (the entire document). So if in one search I get 300.000 hits cont, I have to iterate through all and retrieve the @ID@ field value - this takes a lot of time. Regards, Mile Rosu --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]