What is the best way to handle the primary key case during lucene indexing

java8964 java8964 Mon, 16 Nov 2009 09:16:13 -0800

Hi, 

In our application, we will allow the user to create a primary key defined in 
the document. We are using lucene 2.9.
In this case, when we index the data coming from the client, if the metadata 
contains the primary key defined, 
we have to do the search/update for every row based on the primary key.


Here is our current problems:

1) If the meta data coming from client defined a primary key (which can contain 
one or multi fields), 
    then for the data supplied from the client, we have to make sure that later 
row will override the previous row, if they have the same primary key as the 
data.
2) To do the above, we have to loop through the data first, to check if any 
later rows containing the same PK as the previous rows, so we will build the 
MAP in the memory to override the previous one by the latest ones.
This is a very expensive operation. 
3) Even in this case, for every row after the above filter steps, we still have 
to search the current index to see if any data with the same PK exist or not. 
So we have to do the remove before we add the new data in the index.

I want to know if anyone has the same requirement like this PK using the 
lucene? What is the best way to index data in this case?

First, I am thinking if it is possible to remove the above step2?
the problem for the lucene is that when we add document in the index, we can 
NOT search it before commit it.
But we only commit once when the whole data file is finished. So we have to 
loop through the data once to check to see if any data sharing the same PK in 
the data file.
I am wondering if there is a way in the index writer, before it commits 
anything, when we add the new document into it, it can do the merging of the PK 
data? What I mean is that if the same PK data already exist in any previous 
added document, just remove it and let the new added data containing the same 
PK data take the place? If we can do this, then the whole pre checking data 
step can be removed.

Second, for the above step 3, if the searching the existing index is NOT 
avoidable, what is the fast way to search by the PK? Of course we already 
indexed all the PK fields. When we add new data, we have to search every row of 
existing index by the PK fields, to see if it exist or not. If it does, remove 
it and add the new one.
We constructor the query by the PK fields at run time, then search it row by 
row. This is also very bad as the indexing the data for performance.

Here is what I am thinking?
1) Can I use the Indexreader.term(terms)? I heard it is much faster than the 
query searching? Is that right?
2) Currently we are do the search row by row? Should I do it in batching? Like 
I will combine 100 PK search into one search, using Boolean term? So one search 
will give me back all the data in this 100 PK which are in the index. Then I 
can remove them from the index using the result set. In this case, I only need 
to do 1/100 search requests as before? This will much faster than row by row in 
theory.


Please let me know any feedbacks? If you ever dealed with PK data support, 
please share some thougths and experience.

Thanks for your kind help.
                                          
_________________________________________________________________
Hotmail: Free, trusted and rich email service.
http://clk.atdmt.com/GBL/go/171222984/direct/01/

What is the best way to handle the primary key case during lucene indexing

Reply via email to