Re: Lucene-based Distributed Index Leveraging Hadoop

Stefan Groschupf Fri, 22 Aug 2008 03:43:51 -0700

Hi,

In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible).


Upgrades...

The way we solve this with katta is that we simply deploy a new smallindex and use * in the client instead of a fixed index name.Than once a night we merge all the small indexes (since this slowsdown things) together to a big new index.To solve the problem of duplicate documents each document gets atimestamp and in the client we do a simple dedub based on a key anduse always the latest document with the latest time stamp.


Dependencies...

Katta is independent of those technologies, it is lucene, zookeeperand hadoop RPI (instead of RMI, http or Apache Mina). Though wesupport loading index shards from a hadoop file system, but you alsocan load them from a mounted remote hdd NAS or what ever you like

The obvious drawback being that dlucene
doesn't seem to be an active public project.

Mark need to answer this but dlucene is checked in to the katta svnand I saw Marko checking in changes to dlucene. There was a discussionbetween Mark and me to bring dlucene and katta together and I reallywould love to see that happen but unfortunately we had a lot ofpressure from our customer to deliver something so we had to focus onother things. More developers getting involved would clearly helphere.. :-)



Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.

Katta will be deployed into a production system of our customer inless than 4 weeks - so we working hard to iron out issues.However katta is running since 6 weeks in a 10 node test environmentwith heavy load.

Stefan

Re: Lucene-based Distributed Index Leveraging Hadoop

Reply via email to