Dennis,
Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs.
Intuitively I believe RAMDirectory should be faster, isn't it ? Do you
have any benchmark for the two?
Dennis Kubes wrote:
Trey Spiva wrote:
According to a hadoop tutorial
(http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
"you don't want to search using DFS, you want to search using local
filesystems. Once the index has been created on the DFS you can
use the hadoop copyToLocal command to move it to the local file
system as such" ... "Understand that at this point we are not using
the DFS or MapReduce to do the searching, all of it is on a local
machine".
So my understanding is that hadoop is only good for batch index
building, and is not proper for incremental index building and
search. Is this true?
That is correct. DFS for batch processing and MapReduce jobs. Local
servers (disks) for serving indexes. Even better put local indexes
(not segments, just indexes) in RAM.
The reason I am asking is that when I read the article ACM article by
Mike Cafarella and Doug Cutting, to me it sounded like the concern
was to make the index structures fit in the primary memory, not the
entire crawled database. Did I miss understand the ACM article?
No, what they are saying is the more pages per index per machine on
hard disk the slower the search. Keeping the main indexes, but not
the segments which hold raw page content, in RAM can speed up search
significantly.
One way to do this if you are running on linux is to create a tempfs
(which is ram) and then mount the filesystem in the ram. Then your
index acts normally to the application but is essentially served from
Ram. This is how we server the Nutch lucene indexes on our web search
engine (www.visvo.com) which is ~100M pages. Below is how you can
achieve this, assuming your indexes are in /path/to/indexes:
mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes
This would of course be limited by the amount of RAM you have on the
machine. But with this approach most searches are sub-second.
Dennis Kubes