Some thoughts around the use of reader.isDeleted and hasDeletions

Shai Erera Mon, 08 Jun 2009 13:07:32 -0700

Hi

I recently read CHANGES to learn more about the readOnly parameter
IndexReader now supports, and came across LUCENE-1329 with a comment that
isDeleted was made not synchronized if readOnly=true (e.g.
ReadOnlyIndexReader), which can affect search code, as it is usually the
bottleneck for search operations.


I searched the code and was surprised to see isDeleted and hasDeletions are
not called from any search code. Instead, the code, such as SegmentTermDocs,
uses the deletedDocs instance directly. So in fact isDeleted wasn't the
bottleneck (unless this was part of the changes in 1329). Anyway, doesn't
matter, that's good !

However, I did find out some indexing code whic calls these two, like
SegmentMerger when it merges fields and term vectors. I think that we can
improve that code by writing some specialized code for merging - if the
reader has no deletions, there's no point checking for each document if
there are deletions and/or if the document was deleted. In fact,
SegmentMerger checks for each document: (1) whether the reader has deletions
and if the document was deleted, (2) if the reader has a matching reader and
(3) if checkAbort is not null.

I have a couple of suggestions to simplify that code:
1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to
get rid of the unnecessary if (hasDeletions) check for each doc.
2. In these, check if the reader has matching reader or not, and execute the
appropriate code in each case.
3. Also, check up front if checkAbort != null.
3.1 (3) duplicates the code - so perhaps instead I can create a dummy
checkAbort, which does nothing? That way we'll always call
checkAbort.work(). But this adds a method call, so I'm not sure.

(same optimizations for mergeVectors()).

In addition, I think something can be done w/ SegmentTermDocs. Either create
a specialized STD based on whether it has deletions or not, or create a
dummy BitVector which returns false for every check. That way we can
eliminate the checks in each next(), skipTo(). Dummy BitVector will leave
the file as-is, but will add a method call, so I think I lean towards the
specialized STD. This can be as simple as making STD abstract, with a static
create() method which creates the right instance.

I believe there are other places where we can make such optimizations. If
the overall direction seems right to you, I can open an issue and start to
work on a patch. Currently I've patched SegmentMerger, and besides the class
getting bigger, nothing bad happened (meaning all tests pass), though it
might be interesting to check how it affects performance.

What do you think?

Shai

Some thoughts around the use of reader.isDeleted and hasDeletions

Reply via email to