> In hbase, on split, daughters hold a reference to either the top or bottom > half of their parent region. References are undone by compactions; as part > of compaction, the part of the parent referenced by the daughter gets > written out to store files under the daughter. Daughters try to undo > references as promptly as possible because regions with references are not > splitable (references to references, and so on, would soon become > unmanageble). > > In your description, you mentioned that daughter regions reference their > parents' index. When I said, 'a rewrite of the lucene index', I was asking, > as per hbase regions, if you followed the model and wrote a new lucene index > comprised of daughter-only content at compaction time. Or do you just > 'optimize' and let the references build up so the daughter of a daughter > points all the ways up to the parent?
Similar as in HBase, a split is not allowed if there are references to parent files, whether they are store files or index files. > So, why do you think it so slow going via HDFS FileSystem when the data is > local? Is it the block-orientated access or is there just a high-tax going > via the HDFS FS interface? Because of how DFSClient.DFSInputStream is implemented, a socket connection is opened and closed for almost every random read. We'll experiment resuing socket connections in DFSInputStream. Cheers, Ning
