Re: adding "explicit commits" to Lucene?

Doron Cohen Mon, 15 Jan 2007 23:13:05 -0800

The problem Ning pointed out seems to stem from the two roles of
IndexReader:
(1) reading (read only) the Index for searching and for inspecting its
content;
(2) modifying the index by deleting documents;


This is further complicated by the fact that often a reader is used for
search and then returned docs are deleted by docid.

Perhaps one possibility is to define DocumentDeleter as a subclass of
IndexReader searcher. It would always open the top most generation. It
would (as today) fail to delete if it is not the top most generation. It
would support search, but would be recommended to be used only for update
purposes. Mmmm...  It is becoming too complex I'm afraid.

So a better (?) option: (1) add to IndexWriter deleteByTerm() (and
deleteByQuery()) (like NewIndexModifier..) - these deletion methods would
then be performed on top most generation - same as addDocument(); (2)
IndexReader delete() methods would fail (as today) if it is not top most
generation - so it would only work when all previous changes were committed
(which is always true if an application is using (the default) auto
commit).

One comment about permanent IDs (PIDs) - I think that Lucene's choice to
not maintain PIDs on behalf of applications is the right way to go. For
efficiency, even if PIDs were maintained by Lucene, internal changing IDs
would exist and low level operations would use those IDs. But in addition
Lucene would need to maintain the mapping between the two - IDs and PIDs -
and notify an application adding a doc what PID was assigned to it, etc.
Seems better to leave this for applications.

Doron

Chuck Williams <[EMAIL PROTECTED]> wrote on 15/01/2007 21:49:05:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if none go
> into the index that the index is not changed in any way.
>
> I have UID's but they cannot ensure the latter property, i.e. they
> cannot ensure side-effect-free rollbacks.
>
> Yes, if you have no reliance on internal Lucene structures like doc-id's
> and segments, then that shouldn't matter.  But many capabilities have
> such reliance for good reasons.  E.g., ParallelReader, which is a public
> supported class in Lucene, requires doc-id synchronization.  There are
> similar good reasons for an application to take advantage of doc-ids.
>
> Lucene uses doc-id's in many of its API's and so it is not surprising
> that many applications rely on them, and I'm sure misuse them not fully
> understanding the semantics and uncertainties of doc-id changes due to
> merging segments with deletes.
>
> Applications can use doc-ids for legitimate and beneficial purposes
> while remaining semantically valid.  Making such capabilities efficient
> and robust in all cases is facilitated by application control over when
> doc-id's and segment structure change at a granularity larger than the
> single Document.
>
> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.
>
> Chuck
>
>
> robert engels wrote on 01/15/2007 07:34 PM:
> > I honestly think that having a unique OID as an indexed field and
> > putting a layer on top of Lucene is the best solution to all of this.
> > It makes it almost trivial, and you can implement transaction handling
> > in a variety of ways.
> >
> > Attempting to make the doc ids "permanent" is a tough challenge,
> > considering the orignal design called for them to be "non permanent".
> >
> > It seems doubtful that you cannot have some sort of primary key any
> > way and be this concerned about the transactional nature of Lucene.
> >
> > I vote -1 on all of this. I think it will detract from the simple and
> > efficient storage mechanism that Lucene uses.
> >
> > On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
> >
> >> Ning Li wrote on 01/15/2007 06:29 PM:
> >>> On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> >>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565)
feature
> >>>>     could have a more efficient implementation (just like Solr) when
> >>>>     autoCommit is false, because deletes don't need to be flushed
> >>>>     until commit() is called.  Whereas, now, they must be
aggressively
> >>>>     flushed on each checkpoint.
> >>>
> >>> If a reader can only open snapshots both for search and for
> >>> modification, I think another change is needed besides the ones
> >>> listed: assume the latest snapshot is segments_5 and the latest
> >>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
> >>> snapshot segments_5, performs a few deletes and writes a new
> >>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
> >>> the 2 new segments which are in segmentsx_7 but not in segments_5.
> >>> Such segments to include are easily identifiable only if they are not
> >>> merged with segments in the latest snapshot... All these won't be
> >>> necessary if a reader always opens the latest checkpoint for
> >>> modification, which will also support deletion of non-committed
> >>> documents.
> >> This problem seems worse.  I don't see how a reader and a writer can
> >> independently compute and write checkpoints.  The adds in the writer
> >> don't just create new segments, they replace existing ones through
> >> merging.  And the merging changes doc-ids by expunging deletes.  It
> >> seems that all deletes must be based on the most recent checkpoint, or
> >> merging of checkpoints to create the next snapshot will be
considerably
> >> more complex.
> >>
> >> Chuck
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: adding "explicit commits" to Lucene?

Reply via email to