There was a discussion on IRC about balancing and locality yesterday. I was thinking about the locallity problem, and started thinking about the possibility of having a HDFS operation that would force a file to have local replicas. I think approach this has the following pros over forcing a compaction.
* Only one replica is copied across the network. * Avoids decompressing, deserializing, serializing, and compressing data. The tricky part about this approach is that Accumulo needs to decide when to ask HDFS to make a file local. This decision could be based on a function of the file size and number of recent accesses. We could avoid decompressing, deserializing, etc today by just copying (not compacting) frequently accessed files. However this would write 3 replicas where a HDFS operation would only write one. Note for the assertion that only one replica would need to be copied I was thinking of following 3 initial conditions. I am assuming we want to avoid all three replicas on same rack. * Zero replicas on rack : can copy replica to node and drop replica on another rack. * One replica on rack : can copy replica to node and drop any other replica. * Two replicas on rack : can copy replica to node and drop another replica on same rack.
