> The second aspect is that our queries are time based and this time window
> follows a familiar pattern of old data not being queried much. Hence we
> would like to preserve the most recent data in the HDFS cache ( impala is
> helping us manage this aspect via their command set ) but we would like the
> next recent amount of data chunks to land on an SSD that is present on
> every datanode. The remaining set of blocks which are "very old but in
> large quantities" would land on spinning disks. The decision to choose a
> given volume is based on the file name as we can control the filename that
> is being used to generate the file.
>

Have you tried the 'setStoragePolicy' command? It's part of the HDFS
"Heterogeneous Storage Tiers" work and seems to address your scenario.

> 1. Is there a way to control that all file blocks belonging to a particular
> hdfs directory & file go to the same physical datanode ( and their
> corresponding replicas as well ? )

This seems inherently hard: the file/dir could have more data than a
single DataNode can host. Implementation wise, if requires some sort
of a map in BlockPlacementPolicy from inode or file path to DataNode
address.

My 2 cents..

-- 
Zhe Zhang
Software Engineer, Cloudera
https://sites.google.com/site/zhezhangresearch/

Reply via email to