Re: Hadoop distributed search.

Dennis Kubes Tue, 04 Dec 2007 10:48:30 -0800


Trey Spiva wrote:

Thanks for your help.

On Dec 4, 2007, at 10:08 AM, Jasper Kamperman wrote:
According to a hadoop tutorial(http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,"you don't want to search using DFS, you want to search using localfilesystems. Once the index has been created on the DFS you canuse the hadoop copyToLocal command to move it to the local filesystem as such" ... "Understand that at this point we are not usingthe DFS or MapReduce to do the searching, all of it is on a localmachine".So my understanding is that hadoop is only good for batch indexbuilding, and is not proper for incremental index building andsearch. Is this true?
That is correct. DFS for batch processing and MapReduce jobs.Local servers (disks) for serving indexes. Even better put localindexes (not segments, just indexes) in RAM.
In the NutchHadoopTutorial it says "the directory which it points toshould contain not just the index directory but also the linkdb,segments, etc. All of these different databases are used by thesearch. This is why we copied over the crawled directory and not justthe index directory."

Yes, that is correct. Only indexes in memory, other databases on localdisk.

We have found that a 4G machine can handle roughly 2M pages in the indexwith no swapping occurring. Also load on the machine drop topractically nothing even for 20+ queries per second because there isvirtually zero IO.


Dennis Kubes

If I understand your comment correctly, you are saying to not copythe linkdb, and segments data just the index directory. Is thatcorrect? I think this is the source of my confusion, because itsounds like the entire crawl data needs to be copied to each searchmachine.
In the trick below, /path/to is the directory that has all the crawldata, /path/to/linkdb, /path/to/segments, etc. The trick moves the/path/to/indexes to another directory, then mounts a RAM filesystem on/path/to/indexes. So to your nutch everything just looks like a bigcrawl dir, but whenever it is accessing an index it is actuallygetting it from RAM.
The reason I am asking is that when I read the article ACM articleby Mike Cafarella and Doug Cutting, to me it sounded like theconcern was to make the index structures fit in the primary memory,not the entire crawled database. Did I miss understand the ACMarticle?
No, what they are saying is the more pages per index per machine onhard disk the slower the search. Keeping the main indexes, but notthe segments which hold raw page content, in RAM can speed up searchsignificantly.
One way to do this if you are running on linux is to create a tempfs(which is ram) and then mount the filesystem in the ram. Then yourindex acts normally to the application but is essentially servedfrom Ram. This is how we server the Nutch lucene indexes on our websearch engine (www.visvo.com) which is ~100M pages. Below is howyou can achieve this, assuming your indexes are in /path/to/indexes:
mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes
This would of course be limited by the amount of RAM you have on themachine. But with this approach most searches are sub-second.
Thanks for the information.
Dennis Kubes

Re: Hadoop distributed search.

Reply via email to