Re: Hadoop distributed search.

Dennis Kubes Thu, 06 Dec 2007 20:27:40 -0800

We noticed significant improvement between regular filesystem andkeeping indexes in memory. I haven't tried force reading all of thefiles so I don't know what that performance difference would be. I cansay that tmpfs is transparent to the application and IO drops topractically zero when used.


Dennis Kubes


Otis Gospodnetic wrote:

Dennis,

Does the tmpfs really help more than the normal FS caching would help?
For example, if you were to force the FS to read the whole index (files), it 
would read them into RAM and, hopefully, cache them.  Wouldn't that achieve the 
same effect as tmpfs?  I've done the former with very large indices and it had 
a very clear and positive effect, but I never directly compared it to tmpfs.  
Intuitively speaking, using tmpfs and the regular FS caching should have the 
same effect, no?
If the machine has enough RAM to keep the whole index in RAM via tmpfs, then 
there should also be enough memory for the FS to keep the index in its memory 
buffers.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, December 4, 2007 12:37:19 PM
Subject: Re: Hadoop distributed search.



Trey Spiva wrote:
According to a hadoop tutorial(http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,
"you don't want to search using DFS, you want to search using localfilesystems. Once the index has been created on the DFS you can
use
the hadoop copyToLocal command to move it to the local file system as
such" ... "Understand that at this point we are not using the DFS orMapReduce to do the searching, all of it is on a local machine".
So my understanding is that hadoop is only good for batch indexbuilding, and is not proper for incremental index building and
search.
Is this true?
That is correct. DFS for batch processing and MapReduce jobs. Localservers (disks) for serving indexes. Even better put local indexes(notsegments, just indexes) in RAM.
The reason I am asking is that when I read the article ACM article by
Mike Cafarella and Doug Cutting, to me it  sounded like the concern
was
to make the index structures fit in the primary memory, not the
entire
crawled database.  Did I miss understand the ACM article?
No, what they are saying is the more pages per index per machine on
harddisk the slower the search. Keeping the main indexes, but not thesegments which hold raw page content, in RAM can speed up searchsignificantly.
One way to do this if you are running on linux is to create a tempfs(which is ram) and then mount the filesystem in the ram. Then yourindex acts normally to the application but is essentially served fromRam. This is how we server the Nutch lucene indexes on our web searchengine (www.visvo.com) which is ~100M pages. Below is how you canachieve this, assuming your indexes are in /path/to/indexes:
mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes
This would of course be limited by the amount of RAM you have on themachine. But with this approach most searches are sub-second.
Dennis Kubes

Re: Hadoop distributed search.

Reply via email to