Re: Using Lucene - Design Question

Peter W. Thu, 22 Feb 2007 14:40:26 -0800

Hello,

If you have experience using XML and doing web services requests
Solr is what you need. It's production quality code and evolving
quickly. It has a remarkable amount of extra functionality.


For CORBA type programmers, go with terracotta. It looks to go a
step further beyond sharing objects to sharing/clustering JVMs.

The RMI capabilities of RemoteSearchable within Lucene seem to
have been developed before Solr gained traction. I tried taking
some working RMI code and writing an inner class with Lucene but
it didn't feel robust.

Research on the mailing lists brings up older file copying
techniques based on synching the indexes with rsync. Probably
still in use, it looks to be an old-school solution better
addressed by Solr.

If you are mirroring your index in a database, there are some
combined Lucene/db update methods available:

1. mysql replication - data on the master is continuously
updated and replicates behind the scenes to remote slaves.
Lucene/db indexing code on each remote slave is a cron job.

2. Lucene indexing application on remote boxes makes network
call to central database, getting/indexing new data and reloading
it's own local ramdir.

For someone trying to get work done, use incremental updates to
one local index first. Then explore writing to multiple indexes and
reading them using MultiSearcher.

Afterward, use HTTP-based updates/requests with Solr to scale out.

Hope that helps.

Peter W.


On Feb 20, 2007, at 5:29 PM, orion wrote:

If you'd like to try using Terracotta, we (Terracotta) would beglad to help
you out.  If you want more info, you can email me directly (orion at
terracotta.org) or you can use our web forums (http://forums.terracotta.org)
or our user mailing list (http://lists.terracotta.org/)

Cheers,
Orion



shai deljo wrote:
I considered getting  Lucene in action but figured I'll wait for the
DVD to come out ;).
Seriously though, they write about RemoteSearchable and use RMI, Is
this the recommended solution? does it scale well?
Thanks

On 2/20/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Well, there is also a Remote cousin there. That will let youdistribute
your indices over N severs (sounds like you'll need multiple).  You
should really take a stroll through Lucene's javadoc, it'sincrediblynice now in winter time. Or ... clears throat.... you could geta book
;)

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: shai deljo <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, February 20, 2007 2:05:25 PM
Subject: Re: Using Lucene - Design Question

Hi,
Thanks for the reply.
* Regarding hardware I'll use something similar to: Core 2 Duo -
2.66GHz, 2x300 GB disk drives, 4 GB RAM running on one of the Linux
distributions.
* Regarding response time I'm looking to be ~300 milliseconds for at
least 80% of queries and ~500 milliseconds for 95% of queries.
* Will MultiSearcher (and it's parallel cosine :) ) allow me tosearch
indices cross multiple servers or is the assumption is that all
indices are on 1 server?
Thanks


On 2/20/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Hi Shi,

Nobody will be able to give you the precise answer, obviously.  The
best way is to try.
You didn't say what response time is desirable nor what kind of
hardware you will be using.
I wouldn't bother with the Berkeley DB-backed Lucene index for now,
just use the regular one (maybe use non-compound format).
If you need to partition your index, MultiSearcher will help yousearch
all your indices, and its Parallel cousin will let youparallelize those
searches.
It sounds like rsync will work, but you'll have to make surethat the
segments file gets rsynced last.
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: shai deljo <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, February 20, 2007 5:51:13 AM
Subject: Using Lucene - Design Question

Hi,
I have no experience with Lucene and I'm trying to collect some
information in order to determine what solution is best for me.
I need to index ~50M documents (starting with 10M), the size ofeachdocument is ~2k-~5k and I'll index a couple of fields perdocument. Iexpect ~20 queries per seconds and each query is ~4 terms.Update rate- not sure what is best and/or possible strategy based onperformance,i.e. incremental indexing vs. pushing a full index but as far astheproduct is concerned most data can be updated daily, the head(let's
say 20%) needs hourly (or at least on the order of hours) update.
I also need to be able to override the scoring/ranking andinject myown logic and of course my main concern is response time,especiallysince i have additional computation on the hits before returningthe
results.

BTW, for the additional ranking/computation i will need to retrieve
values that are mapped by a term-field key, i.e. i can't knowthe keyuntil i have the result and the query in my hands. i figured iwoulduse Oracle Berkeley DB Java edition in order to keep the callsas much
as possible in the memory -> any advise on this as well ?

For these requirements, do i need to worry about partitioning the
Index? If i do partition it, is there a solution to merge theresultsback or do i need to do it on my own (does Solr do it for me andif it
does, can i override the scoring there)?
AS far as serving multiple users, will a simple rsync of the index
between multiple nodes running the same index (i am not thatsensitive
to data integrity) work or do i need to look at something like
terracotta?

In short, i am looking for the simplest solution.

Thanks in advance.
Shi
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context: http://www.nabble.com/Using-Lucene---Design-Question-tf3259160.html#a9073976
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene - Design Question

Reply via email to