Re: Distributed Lucene Questions

Aaron Kimball Mon, 01 Jun 2009 17:40:40 -0700

According to the JIRA issue mentioned, it doesn't seem to be production
ready. The author claims it is "a prototype... not ready for inclusion". The
patch was posted over a year ago, and there's been no further work or
discussion. You might want to email the author Mark Butler (
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=butlermh)
directly.

This probably renders the performance data question moot.

In general, if you're using Hadoop and HDFS to serve some content, updates
will need to be performed by rewriting a whole index. So frequent updates
are going to be troublesome. Likely what will happen is that updates will
need to be batched up, rewritten to new index files, and then those will be
installed in place of the outdated ones. I haven't read their design doc,
though, so they might do something different, but since HDFS doesn't allow
for modification of closed files, it'll be challenging to be more clever
than that.

You might want to go with option (1) and investigate using something like
memcached, etc, to manage interactive query load.

- Aaron

On Mon, Jun 1, 2009 at 9:54 AM, Tarandeep Singh <[email protected]> wrote:

> Hi All,
>
> I am trying to build a distributed system to build and serve lucene
> indexes.
> I came across the Distributed Lucene project-
> http://wiki.apache.org/hadoop/DistributedLucene
> https://issues.apache.org/jira/browse/HADOOP-3394
>
> and have a couple of questions. It will be really helpful if someone can
> provide some insights.
>
> 1) Is this code production ready?
> 2) Does someone has performance data for this project?
> 3) It allows searches and updates/deletes to be performed at the same time.
> How well the system will perform if there are frequent updates to the
> system. Will it handle the search and update load easily or will it be
> better to rebuild or update the indexes on different machines and then
> deploy the indexes back to the machines that are serving the indexes?
>
> Basically I am trying to choose between the 2 approaches-
>
> 1) Use Hadoop to build and/or update Lucene indexes and then deploy them on
> separate cluster that will take care or load balancing, fault tolerance
> etc.
> There is a package in Hadoop contrib that does this, so I can use that
> code.
>
> 2) Use and/or modify the Distributed Lucene code.
>
> I am expecting daily updates to our index so I am not sure if Distribtued
> Lucene code (which allows searches and updates on the same indexes) will be
> able to handle search and update load efficiently.
>
> Any suggestions ?
>
> Thanks,
> Tarandeep
>

Re: Distributed Lucene Questions

Reply via email to