Re: [PROPOSAL] index server project

Stefan Groschupf Mon, 06 Nov 2006 05:18:39 -0800

Hi,

do people think we are already in a stage where we can setup somebasic infrastructure like mailing list and wiki and move thediscussion to the new mailing list. Maybe setup a incubator project?


I would be happy to help with such basic tasks.

Stefan



Am 31.10.2006 um 22:03 schrieb Yonik Seeley:

On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> We assume that, within an index, a file with a given name iswritten
>> only once.
>
> Is this necessary, and will we need the lockless patch (that avoids
> renaming or rewriting *any* files), or is Lucene's current index
> behavior sufficient?

It's not strictly required, but it would make index synchronization a
lot simpler. Yes, I was assuming the lockless patch would becommittedto Lucene before this project gets very far. Something more thanthat
would be required in order to keep old versions, but this could be as
simple as a Directory subclass that refuses to remove files for atime.


Or a snapshot (hard links) mechanism.
Lucene would also need a way to open a specific index version (rather
than just the latest), but I guess that could also be hacked into
Directory by hiding later "segments" files (assumes lockless is
committed).

> It's unfortunate the master needs to be involved on everydocument add.
That should not normally be the case.

Ahh... I had assumed that "id" in the following method was documentid:

 IndexLocation getUpdateableIndex(String id);

I see now it's index id.

But what is index id exactly?  Looking at the example API you laid
down, it must be a single physical index (as opposed to a logical
index).  In which case, is it entirely up to the client to manage
multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it
would be up to the client to know this, and to query across the
different "photo0", "photo1", "photo2" indicies.  The master would
have no clue those indicies were related.  Hmmm, that doesn't work
very well for deletes though.

It seems like there should be the concept of a logical index, that is
composed of multiple shards, and each shard has multiple copies.

Or were you thinking that a cluster would only contain a single
logical index, and hence all different index ids are simply different
shards of that single logical index?  That would seem to be consistent
with ClientToMasterProtocol .getSearchableIndexes() lacking an id
argument.

I was not imagining a real-time system, where the next query after a
document is added would always include that document.  Is that a
requirement?  That's harder.


Not real-time, but it would be nice if we kept it close to what Lucene
can currently provide.
Most people seem fine with a latency of minutes.

At this point I'm mostly trying to see if this functionality wouldmeet
the needs of Solr, Nutch and others.


It depends on the project scope and how extensible things are.

It seems like the master would be a WAR, capable of running stand-alone.

What about index servers (slaves)?  Would this project include just
the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or
also complete standalone WARs?

I'd need to be able to extend the ClientToSlave protocol to add
additional methods for Solr (for passing in extra parameters and
returning various extra data such as facets, highlighting, etc).

Must we include a notion of document identity and/or documentversion in
the mechanism? Would that facillitate updates and coherency?


It doesn't need to be in the interfaces I don't think, so it depends
on the scope of the index server implementations.

-Yonik


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com

Re: [PROPOSAL] index server project

Reply via email to