Re: Hadoop distributed search.

Dennis Kubes Fri, 07 Dec 2007 13:35:35 -0800

I haven't done those benchmarks, but if I get some time I will.

The way we run the processes at Visvo, and I will open source theframework soon, is we have python scripts which run a continuous jobstream of indexing and moving shards of x million pages to searchservers. Each search server has the linkdb, segments, and indexes foronly that shard. We use master crawldb and linkdb but have processeswhich will create shard linkdbs for only the content in the shardsegments.

Each shard is pushed out to its own search server. The indexes areabout 2G in size and the segments about 20G. So the tempfs was a quickhack that turned out to work very well. We could push the shard out andthen run a simple script to mount the indexes in memory. We could alsounmount them back to disk if needed. All of this is transparent to theSearchServer which thinks it is looking at a local file system.

I do think it would be useful for the SearchServer to have an option fora RAMDirectory. I don't know if it currently does. This goes intodiscussions of creating a master/slave type framework for monitoring andmaintaining shards though.


Dennis Kubes

Enis Soztutar wrote:

Dennis,
Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs.Intuitively I believe RAMDirectory should be faster, isn't it ? Do youhave any benchmark for the two?
Dennis Kubes wrote:
Trey Spiva wrote:
According to a hadoop tutorial(http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
"you don't want to search using DFS, you want to search using localfilesystems. Once the index has been created on the DFS you canuse the hadoop copyToLocal command to move it to the local filesystem as such" ... "Understand that at this point we are not usingthe DFS or MapReduce to do the searching, all of it is on a localmachine".
So my understanding is that hadoop is only good for batch indexbuilding, and is not proper for incremental index building andsearch. Is this true?
That is correct. DFS for batch processing and MapReduce jobs. Localservers (disks) for serving indexes. Even better put local indexes(not segments, just indexes) in RAM.
The reason I am asking is that when I read the article ACM article byMike Cafarella and Doug Cutting, to me it sounded like the concernwas to make the index structures fit in the primary memory, not theentire crawled database. Did I miss understand the ACM article?
No, what they are saying is the more pages per index per machine onhard disk the slower the search. Keeping the main indexes, but notthe segments which hold raw page content, in RAM can speed up searchsignificantly.
One way to do this if you are running on linux is to create a tempfs(which is ram) and then mount the filesystem in the ram. Then yourindex acts normally to the application but is essentially served fromRam. This is how we server the Nutch lucene indexes on our web searchengine (www.visvo.com) which is ~100M pages. Below is how you canachieve this, assuming your indexes are in /path/to/indexes:
mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes
This would of course be limited by the amount of RAM you have on themachine. But with this approach most searches are sub-second.
Dennis Kubes

Re: Hadoop distributed search.

Reply via email to