The below looks excellent Ning.
Comments interspersed.
Ning Li wrote:
...
UPDATING A COLUMN
Upon receiving a column update request, a region not only adds the
column to the cache part of the store, but also analyzes the column
and adds it to the cache part of the index. Same as the store files,
the Lucene index files are also written to HDFS.
Following the HBase design, to avoid resource contention, a region
server globally schedules the cache flush and the compaction of both
the store files and the index files of all the regions on the server.
How does this work with regard to TTL and cell versions?
Does you have to do a rewrite of the lucene index at compaction time?
Or just call optimize? (I suppose its the former if you need to clean
up 'References' as per below where you talk of splits)
PERFORMANCE ISSUES
Our preliminary performance experiments show that the performance
of building an index is quite reasonable. However, the performance of
random reads in HDFS is so poor that the search performance is
dramatically worse than that on local file systems.
We are exploring different ways to solve this problem. One possibility
is to store a copy on local file system. On the other hand, most likely
HDFS already stores a local copy...
What do you mean by 'dramatic' in the above? This is a sweet feature.
That its slow on first implementation is OK. Are you thinking its so
slow, its not functional?
Regards your 'on the other hand' above, thats a good point. Have you
verified that if a regionerver is running on a datanode, that the lucene
index is written local? Would be interesting to know.
St.Ack