Re: GData Server - Lucene storage

Ian Boston Wed, 07 Jun 2006 01:52:14 -0700

Simon,

Im picking this thread up from the web archive, but I there was sometalk of replication of indexes. This message may not be threadedcorrectly. I've just completed a custom FSDirectory implementation thatis designed to work in a cluster with replication.

The anatomy of this cluster is a shared database (mysql or oracle) andstateless nodes with local disk storage. The index load is not that high(when you look at big Nutch installations), but not tiny either, maybe1TB of raw, with an index of 10GB (a guess).

I would have used rsync, but ideally I wanted it to work with no sysadmin setup (pure java install). I looked at, and really liked NDFS butdecided it was too much admin over head to setup. The deployers like todo maven build deploy; tomcat/catalina.sh start to get up and running(easy life!)

Indexing is performed using a queue (persisted in the db), with adistributed lock manager allowing one of the nodes in the cluster totake responsibility for indexing, notifying all other nodes when done.(then they reload the index). This happens every few minutes in production.

FSDirectory is efficient and fast, and I wanted that in the cluster. Ilooked at JDBCDirectory (from compass framework) but found that evenwith a non compound index, the DB overhead was just too great, (onaverage 1/10 performance on MySQL compared to local Disk, Oracle mightbe better) the problem mainly being seeks into blobs. I guess theBerkley DB Directory is going to be similar in some ways except theseeks may be more efficient.

Eventually I borrowed some concepts from Nutch. The index writer writesa new segment with FSDirectory, then merges into the current segment,that segment is compresses and checksumed (MD5) and sent to thedatabase. Current segments are rotated when they get over 2M. When anode recieves an index reload event, it syncs its local segments withthe DB, and loads them with a MultiReader using FSDirectory.


The sweet spot features are.

Performance is almost the same as FSDirectory, except the end of theIndexWrite operation and the start of the IndexReader operation hasslightly more overhead.

When nodes are added to the cluster, they can validate there localsegment copies and bring them uptodate against the cluster.


There is a real time backup of the the index.

The segments are validated prior to being send to the DB.


You could easily use a SAN/NAS in place of the Db to ship the segments.

-

I haven't done real heavy production tests, but I have had it runningindexing the contents of my hard disk flat our for over 48 hours with200, 2MB segments in the DB.

There is probably some housekeeping (eg merging) that should be done,and not being a Lucene expert, I am bound to have missed something.


If anyone spots anything, please let me know :)


Ian


If your interested you can find the code at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/

The Distributed Lock manager is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/component/service/impl/SearchIndexBuilderWorkerImpl.java

The Indexer is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/component/dao/impl/SearchIndexBuilderWorkerDaoImpl.java

and the JDBC Index shipper is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/ClusterFilesystem.java
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: GData Server - Lucene storage

Reply via email to