Interesting - HDFS-6133 would directly help HBase data locality use case. On Fri, Dec 19, 2014 at 2:20 PM, Yongjun Zhang <yzh...@cloudera.com> wrote:
> Hi, > > FYI, > > A relevant jira HDFS-6133 tries to tell Balancer not to move around the > blocks stored at the favored nodes that application selected. I reviewed > the patch, and the latest on looks good to me. Hope some committers can > pick it up and push it forward. > > Thanks. > > --Yongjun > > > On Fri, Dec 19, 2014 at 1:52 PM, Ananth Gundabattula < > agundabatt...@gmail.com> wrote: > > > > Hello Zhe, > > > > Thanks a lot for the inputs. Storage policies is really what I was > looking > > for one of the problems. > > > > @Nick: I agree that it would be a nice feature to have. Thanks for the > > info. > > > > Regards, > > Ananth > > > > On Fri, Dec 19, 2014 at 10:49 AM, Nick Dimiduk <ndimi...@gmail.com> > wrote: > > > > > HBase would enjoy a similar functionality. In our case, we'd like all > > > replicas for all files in a given HDFS path to land on the same set of > > > machines. That way, in the event of a failover, regions can be assigned > > to > > > one of these other machines that has local access to all blocks for all > > > region files. > > > > > > On Thu, Dec 18, 2014 at 3:36 PM, Zhe Zhang < > zhe.zhang.resea...@gmail.com > > > > > > wrote: > > > > > > > > > The second aspect is that our queries are time based and this time > > > window > > > > > follows a familiar pattern of old data not being queried much. > Hence > > we > > > > > would like to preserve the most recent data in the HDFS cache ( > > impala > > > is > > > > > helping us manage this aspect via their command set ) but we would > > like > > > > the > > > > > next recent amount of data chunks to land on an SSD that is present > > on > > > > > every datanode. The remaining set of blocks which are "very old but > > in > > > > > large quantities" would land on spinning disks. The decision to > > choose > > > a > > > > > given volume is based on the file name as we can control the > filename > > > > that > > > > > is being used to generate the file. > > > > > > > > > > > > > Have you tried the 'setStoragePolicy' command? It's part of the HDFS > > > > "Heterogeneous Storage Tiers" work and seems to address your > scenario. > > > > > > > > > 1. Is there a way to control that all file blocks belonging to a > > > > particular > > > > > hdfs directory & file go to the same physical datanode ( and their > > > > > corresponding replicas as well ? ) > > > > > > > > This seems inherently hard: the file/dir could have more data than a > > > > single DataNode can host. Implementation wise, if requires some sort > > > > of a map in BlockPlacementPolicy from inode or file path to DataNode > > > > address. > > > > > > > > My 2 cents.. > > > > > > > > -- > > > > Zhe Zhang > > > > Software Engineer, Cloudera > > > > https://sites.google.com/site/zhezhangresearch/ > > > > > > > > > >