Re: Creating Lucene index in Hadoop
Lucene on a local disk benefits significantly from the local filesystem's RAM cache (aka the kernel's buffer cache). HDFS has no such local RAM cache outside of the stream's buffer. The cache would need to be no larger than the kernel's buffer cache to get an equivalent hit ratio. And if you're If the two cache sizes are the same, then yes. Just that local FS cache size is adjusted (more?) dynamically. Cheers, Ning
Re: Creating Lucene index in Hadoop
I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li ning.li...@gmail.com writes: Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Creating Lucene index in Hadoop
Does anyone have stats on how multiple readers on an optimized Lucene index in HDFS compares with a ParallelMultiReader (or whatever its called) over RPC on a local filesystem? I'm missing why you would ever want the Lucene index in HDFS for reading. Ian Ning Li ning.li...@gmail.com writes: I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis. As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li ning.li...@gmail.com writes: Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Creating Lucene index in Hadoop
I'm missing why you would ever want the Lucene index in HDFS for reading. The Lucene indexes are written to HDFS, but that does not mean you conduct search on the indexes stored in HDFS directly. HDFS is not designed for random access. Usually the indexes are copied to the nodes where search will be served. With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. Cheers, Ning On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff ian.sobor...@nist.gov wrote: Does anyone have stats on how multiple readers on an optimized Lucene index in HDFS compares with a ParallelMultiReader (or whatever its called) over RPC on a local filesystem? I'm missing why you would ever want the Lucene index in HDFS for reading. Ian Ning Li ning.li...@gmail.com writes: I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis. As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li ning.li...@gmail.com writes: Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Creating Lucene index in Hadoop
Ning Li wrote: With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. I don't think HADOOP-4801 is required. It would help, certainly, but it's so fraught with security and other issues that I doubt it will be committed anytime soon. What would probably help HDFS random access performance for Lucene significantly would be: 1. A cache of connections to datanodes, so that each seek() does not require an open(). If we move HDFS data transfer to be RPC-based (see, e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come for free, since RPC already caches connections. We hope to do this for Hadoop 1.0, so that we use a single transport for all Hadoop's core operations, to simplify security. 2. A local cache of read-only HDFS data, equivalent to kernel's buffer cache. This might be implemented as a Lucene Directory that keeps an LRU cache of buffers from a wrapped filesystem, perhaps a subclass of RAMDirectory. With these, performance would still be slower than a local drive, but perhaps not so dramatically. Doug
Re: Creating Lucene index in Hadoop
1 is good. But for 2: - Won't it have a security concern as well? Or is this not a general local cache? - You are referring to caching in RAM, not caching in local FS, right? In general, a Lucene index size could be quite large. We may have to cache a lot of data to reach a reasonable hit ratio... Cheers, Ning On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting cutt...@apache.org wrote: Ning Li wrote: With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. I don't think HADOOP-4801 is required. It would help, certainly, but it's so fraught with security and other issues that I doubt it will be committed anytime soon. What would probably help HDFS random access performance for Lucene significantly would be: 1. A cache of connections to datanodes, so that each seek() does not require an open(). If we move HDFS data transfer to be RPC-based (see, e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come for free, since RPC already caches connections. We hope to do this for Hadoop 1.0, so that we use a single transport for all Hadoop's core operations, to simplify security. 2. A local cache of read-only HDFS data, equivalent to kernel's buffer cache. This might be implemented as a Lucene Directory that keeps an LRU cache of buffers from a wrapped filesystem, perhaps a subclass of RAMDirectory. With these, performance would still be slower than a local drive, but perhaps not so dramatically. Doug
Re: Creating Lucene index in Hadoop
Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark