Some follow-up on the performance issues: > > PERFORMANCE ISSUES > > Our preliminary performance experiments show that the performance > > of building an index is quite reasonable. However, the performance of > > random reads in HDFS is so poor that the search performance is > > dramatically worse than that on local file systems. > > > What do you mean by 'dramatic' in the above? This is a sweet feature. That > its slow on first implementation is OK. Are you thinking its so slow, its > not functional?
On local FS, real disk IO is expensive. Lucene relies on FS cache to provide high search performance on local FS. Because of this, the following comparisons are based on warm test results. The comparison is between the local FS and a one-node HDFS. HDFS provides high sequential read performance but poor random read performance mainly because of socket overhead when data is warm. On HDFS 0.17.1, the search performance is more than an order of magnitude slower than that on a local FS. Even with reusing socket connection, the search performance is still about an order of magnitude slower. Since this is caused by the socket overhead in HDFS, you see similar results with random reads on a map file. I used HBase's MapFilePerformanceEvaluation. The random read performance is a bit less than 7 times lower than that on a local FS. This is a bit better than the search performance probably because a random read on a map file is several almost-sequential reads on the data file in HDFS. Given the above, would the search performance be acceptable? PS: I saw on http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation that the random read performance on a map file improved quite a bit from 0.17.1 to 0.18.0. Any insight?
