On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
> On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> We assume that, within an index, a file with a given name is written
>> only once.
> Is this necessary, and will we need the lockless patch (that avoids
> renaming or rewriting *any* files), or is Lucene's current index
> behavior sufficient?

It's not strictly required, but it would make index synchronization a
lot simpler. Yes, I was assuming the lockless patch would be committed
to Lucene before this project gets very far.  Something more than that
would be required in order to keep old versions, but this could be as
simple as a Directory subclass that refuses to remove files for a time.

Or a snapshot (hard links) mechanism.
Lucene would also need a way to open a specific index version (rather
than just the latest), but I guess that could also be hacked into
Directory by hiding later "segments" files (assumes lockless is

> It's unfortunate the master needs to be involved on every document add.

That should not normally be the case.

Ahh... I had assumed that "id" in the following method was document id:
 IndexLocation getUpdateableIndex(String id);

I see now it's index id.

But what is index id exactly?  Looking at the example API you laid
down, it must be a single physical index (as opposed to a logical
index).  In which case, is it entirely up to the client to manage
multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it
would be up to the client to know this, and to query across the
different "photo0", "photo1", "photo2" indicies.  The master would
have no clue those indicies were related.  Hmmm, that doesn't work
very well for deletes though.

It seems like there should be the concept of a logical index, that is
composed of multiple shards, and each shard has multiple copies.

Or were you thinking that a cluster would only contain a single
logical index, and hence all different index ids are simply different
shards of that single logical index?  That would seem to be consistent
with ClientToMasterProtocol .getSearchableIndexes() lacking an id

I was not imagining a real-time system, where the next query after a
document is added would always include that document.  Is that a
requirement?  That's harder.

Not real-time, but it would be nice if we kept it close to what Lucene
can currently provide.
Most people seem fine with a latency of minutes.

At this point I'm mostly trying to see if this functionality would meet
the needs of Solr, Nutch and others.

It depends on the project scope and how extensible things are.
It seems like the master would be a WAR, capable of running stand-alone.
What about index servers (slaves)?  Would this project include just
the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or
also complete standalone WARs?

I'd need to be able to extend the ClientToSlave protocol to add
additional methods for Solr (for passing in extra parameters and
returning various extra data such as facets, highlighting, etc).

Must we include a notion of document identity and/or document version in
the mechanism? Would that facillitate updates and coherency?

It doesn't need to be in the interfaces I don't think, so it depends
on the scope of the index server implementations.


Reply via email to