According to a hadoop tutorial (http://wiki.apache.org/nutch/
NutchHadoopTutorial) on wiki,
"you don't want to search using DFS, you want to search using
local filesystems. Once the index has been created on the DFS
you can use the hadoop copyToLocal command to move it to the
local file system as such" ... "Understand that at this point we
are not using the DFS or MapReduce to do the searching, all of it
is on a local machine".
So my understanding is that hadoop is only good for batch index
building, and is not proper for incremental index building and
search. Is this true?
That is correct. DFS for batch processing and MapReduce jobs.
Local servers (disks) for serving indexes. Even better put local
indexes (not segments, just indexes) in RAM.
In the NutchHadoopTutorial it says "the directory which it points
to should contain not just the index directory but also the linkdb,
segments, etc. All of these different databases are used by the
search. This is why we copied over the crawled directory and not
just the index directory."
If I understand your comment correctly, you are saying to not copy
the linkdb, and segments data just the index directory. Is that
correct? I think this is the source of my confusion, because it
sounds like the entire crawl data needs to be copied to each search
machine.
In the trick below, /path/to is the directory that has all the crawl
data, /path/to/linkdb, /path/to/segments, etc. The trick moves the /
path/to/indexes to another directory, then mounts a RAM filesystem
on /path/to/indexes. So to your nutch everything just looks like a
big crawl dir, but whenever it is accessing an index it is actually
getting it from RAM.
The reason I am asking is that when I read the article ACM
article by Mike Cafarella and Doug Cutting, to me it sounded
like the concern was to make the index structures fit in the
primary memory, not the entire crawled database. Did I miss
understand the ACM article?
No, what they are saying is the more pages per index per machine
on hard disk the slower the search. Keeping the main indexes, but
not the segments which hold raw page content, in RAM can speed up
search significantly.
One way to do this if you are running on linux is to create a
tempfs (which is ram) and then mount the filesystem in the ram.
Then your index acts normally to the application but is
essentially served from Ram. This is how we server the Nutch
lucene indexes on our web search engine (www.visvo.com) which is
~100M pages. Below is how you can achieve this, assuming your
indexes are in /path/to/indexes:
mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes
This would of course be limited by the amount of RAM you have on
the machine. But with this approach most searches are sub-second.
Thanks for the information.
Dennis Kubes