> The second aspect is that our queries are time based and this time window > follows a familiar pattern of old data not being queried much. Hence we > would like to preserve the most recent data in the HDFS cache ( impala is > helping us manage this aspect via their command set ) but we would like the > next recent amount of data chunks to land on an SSD that is present on > every datanode. The remaining set of blocks which are "very old but in > large quantities" would land on spinning disks. The decision to choose a > given volume is based on the file name as we can control the filename that > is being used to generate the file. >
Have you tried the 'setStoragePolicy' command? It's part of the HDFS "Heterogeneous Storage Tiers" work and seems to address your scenario. > 1. Is there a way to control that all file blocks belonging to a particular > hdfs directory & file go to the same physical datanode ( and their > corresponding replicas as well ? ) This seems inherently hard: the file/dir could have more data than a single DataNode can host. Implementation wise, if requires some sort of a map in BlockPlacementPolicy from inode or file path to DataNode address. My 2 cents.. -- Zhe Zhang Software Engineer, Cloudera https://sites.google.com/site/zhezhangresearch/