-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36122/
-----------------------------------------------------------

(Updated July 2, 2015, 5:07 a.m.)


Review request for Ambari, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth 
Gunturi, Sid Wagle, and Yusaku Sako.


Bugs: AMBARI-12252
    https://issues.apache.org/jira/browse/AMBARI-12252


Repository: ambari


Description (updated)
-------

Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
that contains a mapping of HDFS data dirs to the last known mount point.

This is used to detect when a data dir becomes unmounted, in order to prevent 
HDFS from writing to the root partition.

Consider the example of a data node configured with these volumes: 

/dev/sda -> / 
/dev/sdb -> /grid/0
/dev/sdc -> /grid/1
/dev/sdd -> /grid/2

Typically, each /grid/#/ directory contains a data folder.
Today, if a data directory becomes unmounted, then the directory will not exist 
and Ambari will not create it automatically. Ambari will simply log a warning, 
and update its cache with the new mount point, which is /  ; that is the 
underlying bug.

If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, 
then DataNode will tolerate the failure, otherwise, the DataNode will die.

Because Ambari will already have "/" in its cache file, the fact that it used 
to be mounted in a non-root drive is lost, so next time DataNode is restarted, 
Ambari will create the data dir which is now mounted on the root partition; 
this is really bad because HDFS will now fill up the root drive.

The admin can still remount the partition, but then needs to restart DataNode 
so Ambari can update its cache.

The ideal way to fix this in Ambari 2.2 is as follows,
- Track which data dirs the admin wants mounted on a non-root partition. If the 
admin wishes all data dirs to be on non-root mounts, but the initial install is 
incorrect, then this should be reported as a problem. 
- Keep the history of the mount points in the database. Today, if the cache 
file is deleted or the host reimaged, then this information is lost.
- Introduce a new state between FAILED and COMPLETED, such as 
COMPLETED_WITH_ERRORS, that will allow tasks to look differently in the UI, so 
the user can clearly detect when a critical but non fatal error happened.
- Plugin with Alert Framework


Diffs
-----

  ambari-agent/src/test/python/resource_management/TestDatanodeHelper.py 
PRE-CREATION 
  ambari-agent/src/test/python/resource_management/TestFileSystem.py 91fd71d 
  ambari-agent/src/test/python/unitTests.py b6f8411 
  
ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py
 ad5a984 

Diff: https://reviews.apache.org/r/36122/diff/


Testing
-------

Ran python unit tests in ambari-agent
----------------------------------------------------------------------
Ran 403 tests in 16.819s


Verified that this worked on a host with multiple mounted volumes.


Thanks,

Alejandro Fernandez

Reply via email to