Hey, all,
I'm working on some improvements to init-hcfs.groovy in order to make
init-hcfs.json more maintainable and also add the ability to make it more
configurable (i.e., only create directories for the apps that are
installed, if using the newly merged roles feature), but before I get too
far I'd like to run my ideas by the community.
Here is how I'm thinking it will work:
1. In the site.yaml, you'll specify some new properties that look
something like [1].
- Alternatively, rather than having to specify the directories via
site.yaml, maybe we can define a new resource type representing an HDFS
directory and declare all required HDFS directories for each app in their
own manifest files. (I think this is the spirit behind
https://issues.apache.org/jira/browse/BIGTOP-1772?) But in order to
do this we'd need some Puppet magic that can aggregate all of these
resources and only call init-hdfs once, passing in all of these
resources.
Is this even possible? I know resource collectors can be used to
order all
of these "hdfs_dir" resources before the "init_hdfs" exec resource, but I
don't know how you'd make the "init_hdfs" exec resource operate on all of
the collected "hdfs_dir" resources, if I'm making any sense.
2. The hadoop::init_hdfs Puppet class will write out a file called
/var/lib/hadoop-hdfs/init-hcfs.yaml that looks something like [2] (very
similar to [1] of course).
- Note that I think it might be best to use YAML instead of JSON,
since YAML files are much easier to write out using a template. On the
other hand, Groovy doesn't have built-in support for YAML like it has for
JSON, so it might be more difficult to modify init-hcfs.groovy to consume
this YAML file, or at least I'd need to add SnakeYAML as a
dependency. Any
thoughts on this?
- If I do add a dependency on SnakeYAML, it would probably also be
good to implement https://issues.apache.org/jira/browse/BIGTOP-1871
along with this change so that init-hcfs.groovy can have its own install
location rather than being part of the hadoop-hdfs package.
3. When calling init-hdfs.sh, the hadoop::init_hdfs Puppet class will
pass it this newly generated /var/lib/hadoop-hdfs/init-hcfs.yaml file,
which it will pass through to init-hcfs.groovy.
4. Finally, init-hcfs.groovy will be changed to read from this YAML file
in the format below instead of the hardcoded JSON file we have been using,
assuming the community is OK with this file format change.
Thanks,
Jonathan Kelly
[1]
hadoop::init_hdfs::hdfs_root_user: hdfs
hadoop::init_hdfs::dirs:
/tmp:
perms: 1777
/user:
perms: 755
owner: ${hadoop::init_hdfs::hdfs_root_user}
/user/root:
perms: 777
owner: root
/var/log:
perms: 1775
owner: yarn
group: mapred
/tmp/hadoop-yarn:
perms: 777
owner: mapred
group: mapred
/var/log/hadoop/yarn/apps:
perms: 1777
owner: yarn
group: mapred
/user/history:
perms: 755
owner: mapred
group: mapred
hadoop::init_hdfs::users:
tom:
perms: 755
alice:
perms: 755
bigtop:
perms: 755
[2]
---
root_user: 'hdfs'
dirs:
'/tmp':
'perms': '1777'
'/tmp/hadoop-yarn':
'perms': '777'
'owner': 'mapred'
'group': 'mapred'
'/user':
'perms': '755'
'owner': 'hdfs'
'/user/history':
'perms': '755'
'owner': 'mapred'
'group': 'mapred'
'/user/root':
'perms': '777'
'owner': 'root'
'/var/log':
'perms': '1775'
'owner': 'yarn'
'group': 'mapred'
'/var/log/hadoop/yarn/apps':
'perms': '1777'
'owner': 'yarn'
'group': 'mapred'
users:
'tom':
'perms': '755'
'alice':
'perms': '755'
'bigtop':
'perms': '755'