Hi List,

As a bit of an experiment I'm redoing some of our indexing and searching
code to try to make it easier to manage and distributed. The system has to
modify its indexes frequently, sometimes in huge batches, and the documents
in the indexes are frequently modified (deleted, modified and readded). Just
for scale we're wanting the system to be capable of searching a terabyte or
so of data.

Currently we have a bunch of index machines indexing to a local file system,
every hour or so they merge to a group of indexes stored on NFS or similar
common filesystem, and the search nodes retrieve the new indexes and search
on those. The merge can take about as long as it took to originally index
the files, since it has to re-index the "contents" field since that field
isn't stored.

After reading this thread:
http://www.gossamer-threads.com/lists/lucene/java-user/13803#13803 There
were several good suggestions but I'm curious, is there a generally accepted
best practice of distributing lucene? The cronjob/link solution which is
quite clean, doesn't work well in a windows environment. While it's my
favorite, no dice... Rats.

So I decided to experiment with a couple different ideas, and I have some
questions.

1) Indexing and Searching Directly from HDFS

Indexing to HDFS is possible with a patch if we don't use CFS. While not
ideal performance-wise, it's reliable and takes care of data redundancy,
component failure and means that I can have cheap small drives instead of a
large expensive NAS. It's also quite simple to implement (see Nutch's
indexer.FsDirectory for the Directory implmentation)

So I would have several indexes (ie 16) and the same number of indexers, and
a searcher for each index (possibly in the same process) that searches each
one directly from HDFS. One problem I'm having is an occasional filenotfound
exception. (Probably locking related)

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open
filename /index/_3.f0
       at org.apache.hadoop.dfs.NameNode.open(NameNode.java:178)
       at sun.reflect.GeneratedMethodAccessor41.invoke (Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:585)
       at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:332)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:468)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:245)

It comes out of the Searcher when I try to do a search while things are
being indexed. I'd be interested to know what exactly is happening when this
exception is thrown, maybe I can design around it. (Do synchronization at
the appropriate times or similar)

2) Index Locally, Search in HDFS

I haven't implemented this but I was thinking something along the lines of
merging every little while and having the searchers refresh after that's
finished. I still have a problem with the merge taking a fairly long time
and if a node fails we lose the documents stored locally in that index.

3) Index HDFS, Search Locally

The system indexes to HDFS and the searchers ask the indexers to pause while
it retrieves the indexes from the store. It's then searched locally and the
Indexers continue trucking along. This, in my head, seems to work alright,
at least until the indexes get very large and copying them is prohibitive.
(Is there a java rsync?) I'll have to investigate how much a performance hit
indexing to the network actually is. If anyone has any numbers I would be
interested in seeing them.

4) Map/Reduce

I don't know a lot about this and haven't been able to find much on applying
map/reduce to lucene indexing. Well, except for the Nutch source code, which
is rather difficult to sort through for an overview. So if anyone has a
snippet or a good overview I could look over I would be grateful. Even if
you can just point at a critical part in Nutch that would also be quite
helpful.

5) Anything else

I would appreciate any insight anyone has on distributing indexes, either on
list or off.

Many Thanks,
Chris

PS. Sorry if this got double posted. Didn't seem to get through first time.

Reply via email to