Hi List, As a bit of an experiment I'm redoing some of our indexing and searching code to try to make it easier to manage and distributed. The system has to modify its indexes frequently, sometimes in huge batches, and the documents in the indexes are frequently modified (deleted, modified and readded). Just for scale we're wanting the system to be capable of searching a terabyte or so of data.
Currently we have a bunch of index machines indexing to a local file system, every hour or so they merge to a group of indexes stored on NFS or similar common filesystem, and the search nodes retrieve the new indexes and search on those. The merge can take about as long as it took to originally index the files, since it has to re-index the "contents" field since that field isn't stored. After reading this thread: http://www.gossamer-threads.com/lists/lucene/java-user/13803#13803 There were several good suggestions but I'm curious, is there a generally accepted best practice of distributing lucene? The cronjob/link solution which is quite clean, doesn't work well in a windows environment. While it's my favorite, no dice... Rats. So I decided to experiment with a couple different ideas, and I have some questions. 1) Indexing and Searching Directly from HDFS Indexing to HDFS is possible with a patch if we don't use CFS. While not ideal performance-wise, it's reliable and takes care of data redundancy, component failure and means that I can have cheap small drives instead of a large expensive NAS. It's also quite simple to implement (see Nutch's indexer.FsDirectory for the Directory implmentation) So I would have several indexes (ie 16) and the same number of indexers, and a searcher for each index (possibly in the same process) that searches each one directly from HDFS. One problem I'm having is an occasional filenotfound exception. (Probably locking related) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /index/_3.f0 at org.apache.hadoop.dfs.NameNode.open(NameNode.java:178) at sun.reflect.GeneratedMethodAccessor41.invoke (Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:332) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:468) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:245) It comes out of the Searcher when I try to do a search while things are being indexed. I'd be interested to know what exactly is happening when this exception is thrown, maybe I can design around it. (Do synchronization at the appropriate times or similar) 2) Index Locally, Search in HDFS I haven't implemented this but I was thinking something along the lines of merging every little while and having the searchers refresh after that's finished. I still have a problem with the merge taking a fairly long time and if a node fails we lose the documents stored locally in that index. 3) Index HDFS, Search Locally The system indexes to HDFS and the searchers ask the indexers to pause while it retrieves the indexes from the store. It's then searched locally and the Indexers continue trucking along. This, in my head, seems to work alright, at least until the indexes get very large and copying them is prohibitive. (Is there a java rsync?) I'll have to investigate how much a performance hit indexing to the network actually is. If anyone has any numbers I would be interested in seeing them. 4) Map/Reduce I don't know a lot about this and haven't been able to find much on applying map/reduce to lucene indexing. Well, except for the Nutch source code, which is rather difficult to sort through for an overview. So if anyone has a snippet or a good overview I could look over I would be grateful. Even if you can just point at a critical part in Nutch that would also be quite helpful. 5) Anything else I would appreciate any insight anyone has on distributing indexes, either on list or off. Many Thanks, Chris PS. Sorry if this got double posted. Didn't seem to get through first time.