Re: adding "explicit commits" to Lucene?

robert engels Mon, 15 Jan 2007 23:05:25 -0800

That is true, but you need to use the same techniques as any db. Youneed to write a tx log file. This has the semantics that you know ifit has committed. Juts like a db. You check that is has committedbefore writing anything to the actual index. Since Lucene does notmodify any segments, it is trivial to restart if this portion fails.Just delete the uncommitted segments on startup, and replay the tx log.

As for the ParallelReader, that doesn't make sense to me (but I amadmitting don't understand the purpose), since the javadoc statesthat that all sub-indexes must be updated in the same manner. Wheredoes the benefit come from then? It seems you are actually performingmore operations (with 2 sub-indexes you are writing twice as manydocuments - same amount of field data though). Is there some otherinformation besides the javadoc that explains the usage/benefit ?

Using a federated search where different fields are in differentindexes would be very difficult as you state, and involve long joinlists (and the scoring logic is VERY difficult unless you create anew "memory index" containing all the results, and then run thecomplete query against this.

Putting the documents in different indexes and joining/weighing theresults is rather easy and works quite well.



On Jan 16, 2007, at 12:38 AM, Chuck Williams wrote:

robert engels wrote on 01/15/2007 08:11 PM:
If that is all you need, I think it is far simpler:

If you have an OID, then al that is required is to a write to a
separate disk file the operations (delete this OID, insert this
document, etc...)

Once the file is permanently on disk. Then it is simple to just keep
playing the file back until it succeeds.
There is no guarantee a given operation will ever succeed so this
doesn't work.
This is what we do in our search server.

I am not completely familiar with parallel reader, but in reading the
JavaDoc I don't see the benefit - since you have to write the
documents to both indexes anyway??? Why is it of any benefit to break
the document into multiple parts?
I'm sure Doug had reasons to write it. My reason to use it is forfastbulk updates, updating one subindex without having to update theothers.
If you have OIDs available, parallel reader can be accomplished in a
far simpler and more efficient manner - we have a completelyfederated
server implementation that was trivial - less < 100 lines of code. We
did it simpler, and create a hash from the OID, and store thedocumentinto a different index depending on the has, then run the queryacross
all indexes in parallel, joining the results.
Lucene has this built in via MultiSearcher and RemoteSearchable.It is
a bit more complex due to the necessity to normalize Weights, e.g. to
ensure the same docFreq's which reflect the union of all indexes are
used for the search in each.

Federated searching addresses different requirements than
ParallelReader.  Yes, I agree that ParallelReader could be done using
UID's, but believe it would be a considerably more expensive
representation to search.  The method used in federated search to
distribute the same query to each index is not applicable.Breaking the
query up into parts that are applied against each parallel index, with
each query part referencing only the fields in a single parallelindex,
would be a challenge with complex nested queries supporting all of the
operators, and much less efficient than ParallelReader.  Modifying all
the primitive Query subclasses to use UID's instead of doc-ids's would
be an alternative, but would be a lot of work and not nearly as
efficient as the existing Lucene index representation that sorts
postings by doc-id.
To illustrate this, consider the simple query, f:a AND g:b, where fand
g are in two different parallel indexes.  Performing the f  and g
queries separately on the different indexes to get possibly very long
lists of results and then joining those by UID will be much slowerthan
BooleanQuery operating on ParallelReader with doc-id sorted postings.
The alternative of a UID-based BooleanQuery would have similar
challenges unless the postings were sorted by UID.  But hey, that's
permanent doc-ids.

Chuck
On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:
My interest is transactions, not making doc-id's permanent.
Specifically, the ability to ensure that a group of adds eitherall gointo the index or none go into the index, and to ensure that ifnone go
into the index that the index is not changed in any way.

I have UID's but they cannot ensure the latter property, i.e. they
cannot ensure side-effect-free rollbacks.
Yes, if you have no reliance on internal Lucene structures likedoc-id'sand segments, then that shouldn't matter. But many capabilitieshavesuch reliance for good reasons. E.g., ParallelReader, which is apublicsupported class in Lucene, requires doc-id synchronization.There aresimilar good reasons for an application to take advantage of doc-ids.
Lucene uses doc-id's in many of its API's and so it is notsurprisingthat many applications rely on them, and I'm sure misuse them notfullyunderstanding the semantics and uncertainties of doc-id changesdue to
merging segments with deletes.

Applications can use doc-ids for legitimate and beneficial purposes
while remaining semantically valid. Making such capabilitiesefficientand robust in all cases is facilitated by application controlover whendoc-id's and segment structure change at a granularity largerthan the
single Document.
If I had a vote it would be +1 on the direction Michael hasproposed,
assuming it can be done robustly and without performance penalty.

Chuck


robert engels wrote on 01/15/2007 07:34 PM:
I honestly think that having a unique OID as an indexed field and
putting a layer on top of Lucene is the best solution to all ofthis.It makes it almost trivial, and you can implement transactionhandling
in a variety of ways.

Attempting to make the doc ids "permanent" is a tough challenge,
considering the orignal design called for them to be "nonpermanent".
It seems doubtful that you cannot have some sort of primary key any
way and be this concerned about the transactional nature of Lucene.
I vote -1 on all of this. I think it will detract from thesimple and
efficient storage mechanism that Lucene uses.

On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
Ning Li wrote on 01/15/2007 06:29 PM:
On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
  * The "support deleteDocuments in IndexWriter" (LUCENE-565)
feature
could have a more efficient implementation (just likeSolr) whenautoCommit is false, because deletes don't need to beflushed
    until commit() is called.  Whereas, now, they must be
aggressively
    flushed on each checkpoint.
If a reader can only open snapshots both for search and for
modification, I think another change is needed besides the ones
listed: assume the latest snapshot is segments_5 and the latest
checkpoint is segmentsx_7 with 2 new segments, then a readeropens
snapshot segments_5, performs a few deletes and writes a new
checkpoint segmentsx_8. The summary file segmentsx_8 shouldincludethe 2 new segments which are in segmentsx_7 but not insegments_5.Such segments to include are easily identifiable only if theyare not
merged with segments in the latest snapshot... All these won't be
necessary if a reader always opens the latest checkpoint for
modification, which will also support deletion of non-committed
documents.
This problem seems worse. I don't see how a reader and awriter canindependently compute and write checkpoints. The adds in thewriter
don't just create new segments, they replace existing ones through
merging. And the merging changes doc-ids by expungingdeletes. Itseems that all deletes must be based on the most recentcheckpoint, or
merging of checkpoints to create the next snapshot will be
considerably
more complex.

Chuck
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: adding "explicit commits" to Lucene?

Reply via email to