Hi,
In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible).
Upgrades...
The way we solve this with katta is that we simply deploy a new small
index and use * in the client instead of a fixed index name.
Than once a night we merge all the small indexes (since this slows
down things) together to a big new index.
To solve the problem of duplicate documents each document gets a
timestamp and in the client we do a simple dedub based on a key and
use always the latest document with the latest time stamp.
Dependencies...
Katta is independent of those technologies, it is lucene, zookeeper
and hadoop RPI (instead of RMI, http or Apache Mina). Though we
support loading index shards from a hadoop file system, but you also
can load them from a mounted remote hdd NAS or what ever you like
The obvious drawback being that dlucene
doesn't seem to be an active public project.
Mark need to answer this but dlucene is checked in to the katta svn
and I saw Marko checking in changes to dlucene. There was a discussion
between Mark and me to bring dlucene and katta together and I really
would love to see that happen but unfortunately we had a lot of
pressure from our customer to deliver something so we had to focus on
other things. More developers getting involved would clearly help
here.. :-)
Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.
Katta will be deployed into a production system of our customer in
less than 4 weeks - so we working hard to iron out issues.
However katta is running since 6 weeks in a 10 node test environment
with heavy load.
Stefan