Simon,
Im picking this thread up from the web archive, but I there was some
talk of replication of indexes. This message may not be threaded
correctly. I've just completed a custom FSDirectory implementation that
is designed to work in a cluster with replication.
The anatomy of this cluster is a shared database (mysql or oracle) and
stateless nodes with local disk storage. The index load is not that high
(when you look at big Nutch installations), but not tiny either, maybe
1TB of raw, with an index of 10GB (a guess).
I would have used rsync, but ideally I wanted it to work with no sys
admin setup (pure java install). I looked at, and really liked NDFS but
decided it was too much admin over head to setup. The deployers like to
do maven build deploy; tomcat/catalina.sh start to get up and running
(easy life!)
Indexing is performed using a queue (persisted in the db), with a
distributed lock manager allowing one of the nodes in the cluster to
take responsibility for indexing, notifying all other nodes when done.
(then they reload the index). This happens every few minutes in production.
FSDirectory is efficient and fast, and I wanted that in the cluster. I
looked at JDBCDirectory (from compass framework) but found that even
with a non compound index, the DB overhead was just too great, (on
average 1/10 performance on MySQL compared to local Disk, Oracle might
be better) the problem mainly being seeks into blobs. I guess the
Berkley DB Directory is going to be similar in some ways except the
seeks may be more efficient.
Eventually I borrowed some concepts from Nutch. The index writer writes
a new segment with FSDirectory, then merges into the current segment,
that segment is compresses and checksumed (MD5) and sent to the
database. Current segments are rotated when they get over 2M. When a
node recieves an index reload event, it syncs its local segments with
the DB, and loads them with a MultiReader using FSDirectory.
The sweet spot features are.
Performance is almost the same as FSDirectory, except the end of the
IndexWrite operation and the start of the IndexReader operation has
slightly more overhead.
When nodes are added to the cluster, they can validate there local
segment copies and bring them uptodate against the cluster.
There is a real time backup of the the index.
The segments are validated prior to being send to the DB.
You could easily use a SAN/NAS in place of the Db to ship the segments.
-
I haven't done real heavy production tests, but I have had it running
indexing the contents of my hard disk flat our for over 48 hours with
200, 2MB segments in the DB.
There is probably some housekeeping (eg merging) that should be done,
and not being a Lucene expert, I am bound to have missed something.
If anyone spots anything, please let me know :)
Ian
If your interested you can find the code at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/
The Distributed Lock manager is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/component/service/impl/SearchIndexBuilderWorkerImpl.java
The Indexer is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/component/dao/impl/SearchIndexBuilderWorkerDaoImpl.java
and the JDBC Index shipper is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/ClusterFilesystem.java
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]