I just thought of one potential issue with this. The same file can be shared by multiple tablets on different tservers. If there are more than 3 tablets sharing a file, it could cause problems if all of them request a local replica. So if hdfs had this operation, Accumulo would have to be careful about which files it requested local blocks for.
On Tue, Jun 30, 2015 at 11:00 AM, Keith Turner <[email protected]> wrote: > There was a discussion on IRC about balancing and locality yesterday. I > was thinking about the locallity problem, and started thinking about the > possibility of having a HDFS operation that would force a file to have > local replicas. I think approach this has the following pros over forcing a > compaction. > > * Only one replica is copied across the network. > * Avoids decompressing, deserializing, serializing, and compressing data. > > The tricky part about this approach is that Accumulo needs to decide when > to ask HDFS to make a file local. This decision could be based on a > function of the file size and number of recent accesses. > > We could avoid decompressing, deserializing, etc today by just copying > (not compacting) frequently accessed files. However this would write 3 > replicas where a HDFS operation would only write one. > > Note for the assertion that only one replica would need to be copied I was > thinking of following 3 initial conditions. I am assuming we want to avoid > all three replicas on same rack. > > * Zero replicas on rack : can copy replica to node and drop replica on > another rack. > * One replica on rack : can copy replica to node and drop any other > replica. > * Two replicas on rack : can copy replica to node and drop another > replica on same rack. > > >
