Re: Creating Lucene index in Hadoop

2009-03-17 Thread Ning Li
 Lucene on a local disk benefits significantly from the local filesystem's
 RAM cache (aka the kernel's buffer cache).  HDFS has no such local RAM cache
 outside of the stream's buffer.  The cache would need to be no larger than
 the kernel's buffer cache to get an equivalent hit ratio.  And if you're

If the two cache sizes are the same, then yes. Just that local FS
cache size is adjusted (more?) dynamically.


Cheers,
Ning


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff

I understand why you would index in the reduce phase, because the anchor
text gets shuffled to be next to the document.  However, when you index
in the map phase, don't you just have to reindex later?

The main point to the OP is that HDFS is a bad FS for writing Lucene
indexes because of how Lucene works.  The simple approach is to write
your index outside of HDFS in the reduce phase, and then merge the
indexes from each reducer manually.

Ian

Ning Li ning.li...@gmail.com writes:

 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark





Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff

Does anyone have stats on how multiple readers on an optimized Lucene
index in HDFS compares with a ParallelMultiReader (or whatever its
called) over RPC on a local filesystem?

I'm missing why you would ever want the Lucene index in HDFS for
reading.

Ian

Ning Li ning.li...@gmail.com writes:

 I should have pointed out that Nutch index build and contrib/index
 targets different applications. The latter is for applications who
 simply want to build Lucene index from a set of documents - e.g. no
 link analysis.

 As to writing Lucene indexes, both work the same way - write the final
 results to local file system and then copy to HDFS. In contrib/index,
 the intermediate results are in memory and not written to HDFS.

 Hope it clarifies things.

 Cheers,
 Ning


 On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 I understand why you would index in the reduce phase, because the anchor
 text gets shuffled to be next to the document.  However, when you index
 in the map phase, don't you just have to reindex later?

 The main point to the OP is that HDFS is a bad FS for writing Lucene
 indexes because of how Lucene works.  The simple approach is to write
 your index outside of HDFS in the reduce phase, and then merge the
 indexes from each reducer manually.

 Ian

 Ning Li ning.li...@gmail.com writes:

 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark







Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
 I'm missing why you would ever want the Lucene index in HDFS for
 reading.

The Lucene indexes are written to HDFS, but that does not mean you
conduct search on the indexes stored in HDFS directly. HDFS is not
designed for random access. Usually the indexes are copied to the
nodes where search will be served. With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

Cheers,
Ning


On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 Does anyone have stats on how multiple readers on an optimized Lucene
 index in HDFS compares with a ParallelMultiReader (or whatever its
 called) over RPC on a local filesystem?

 I'm missing why you would ever want the Lucene index in HDFS for
 reading.

 Ian

 Ning Li ning.li...@gmail.com writes:

 I should have pointed out that Nutch index build and contrib/index
 targets different applications. The latter is for applications who
 simply want to build Lucene index from a set of documents - e.g. no
 link analysis.

 As to writing Lucene indexes, both work the same way - write the final
 results to local file system and then copy to HDFS. In contrib/index,
 the intermediate results are in memory and not written to HDFS.

 Hope it clarifies things.

 Cheers,
 Ning


 On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 I understand why you would index in the reduce phase, because the anchor
 text gets shuffled to be next to the document.  However, when you index
 in the map phase, don't you just have to reindex later?

 The main point to the OP is that HDFS is a bad FS for writing Lucene
 indexes because of how Lucene works.  The simple approach is to write
 your index outside of HDFS in the reduce phase, and then merge the
 indexes from each reducer manually.

 Ian

 Ning Li ning.li...@gmail.com writes:

 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark








Re: Creating Lucene index in Hadoop

2009-03-16 Thread Doug Cutting

Ning Li wrote:

With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.


I don't think HADOOP-4801 is required.  It would help, certainly, but 
it's so fraught with security and other issues that I doubt it will be 
committed anytime soon.


What would probably help HDFS random access performance for Lucene 
significantly would be:
 1. A cache of connections to datanodes, so that each seek() does not 
require an open().  If we move HDFS data transfer to be RPC-based (see, 
e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will 
come for free, since RPC already caches connections.  We hope to do this 
for Hadoop 1.0, so that we use a single transport for all Hadoop's core 
operations, to simplify security.
 2. A local cache of read-only HDFS data, equivalent to kernel's buffer 
cache.  This might be implemented as a Lucene Directory that keeps an 
LRU cache of buffers from a wrapped filesystem, perhaps a subclass of 
RAMDirectory.


With these, performance would still be slower than a local drive, but 
perhaps not so dramatically.


Doug


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?
  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...

Cheers,
Ning


On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting cutt...@apache.org wrote:
 Ning Li wrote:

 With
 http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
 become feasible to search on HDFS directly.

 I don't think HADOOP-4801 is required.  It would help, certainly, but it's
 so fraught with security and other issues that I doubt it will be committed
 anytime soon.

 What would probably help HDFS random access performance for Lucene
 significantly would be:
  1. A cache of connections to datanodes, so that each seek() does not
 require an open().  If we move HDFS data transfer to be RPC-based (see,
 e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
 for free, since RPC already caches connections.  We hope to do this for
 Hadoop 1.0, so that we use a single transport for all Hadoop's core
 operations, to simplify security.
  2. A local cache of read-only HDFS data, equivalent to kernel's buffer
 cache.  This might be implemented as a Lucene Directory that keeps an LRU
 cache of buffers from a wrapped filesystem, perhaps a subclass of
 RAMDirectory.

 With these, performance would still be slower than a local drive, but
 perhaps not so dramatically.

 Doug



Re: Creating Lucene index in Hadoop

2009-03-13 Thread Ning Li
Or you can check out the index contrib. The difference of the two is that:
  - In Nutch's indexing map/reduce job, indexes are built in the
reduce phase. Afterwards, they are merged into smaller number of
shards if necessary. The last time I checked, the merge process does
not use map/reduce.
  - In contrib/index, small indexes are built in the map phase. They
are merged into the desired number of shards in the reduce phase. In
addition, they can be merged into existing shards.

Cheers,
Ning


On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark