I have a fairly straightforward task: I have a collection of N documents and a 
set of "hot" words. I need to find all occurrences of these words in all the 
docs.



The original use case was that I would get all the docs at once. In this case, 
I:

1. Create a single index for all the docs

2. Loop over all hot words. For each word, I find all hits in all the docs

3. I collect and rearrange the hit info to have all hits for each of the 
indexed doc



However, it looks like there might be a different use case: the user might want 
to add one document at a time to the collection and see the search results 
immediately. So for this case I am now doing the following:

1. Loop over docs i = 1 : N. For each doc:

1.1 If i == 1 then create index else update index

1.2 Loop over all hot words. For each word, find all hits in all the docs that 
have been indexed so far, i.e. docs 1 through i

1.3 Collect and rearrange



Of course, this is not particularly efficient, especially because I am forced 
to do a lot or redundant work by searching though docs 1:i instead of just i at 
each iteration. This is because, if I understand it corrently, I can't specify 
"search only the part of index that corresponds to doc X". Or can I?



Is there any way to make this incremental index/search more efficient? For 
instance, is it at all possible to restrict where in the index a search for 
hits is performed? Or any other optimization?



Thanks much



Ilya Zavorin

Reply via email to