[
https://issues.apache.org/jira/browse/BIGTOP-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Boudnik updated BIGTOP-1177:
---------------------------------------
Component/s: Deployment
> Puppet Recipes: Can we modularize them to foster HCFS initiatives?
> ------------------------------------------------------------------
>
> Key: BIGTOP-1177
> URL: https://issues.apache.org/jira/browse/BIGTOP-1177
> Project: Bigtop
> Issue Type: Improvement
> Components: Deployment
> Affects Versions: 0.7.0
> Reporter: jay vyas
> Fix For: backlog
>
>
> In the spirit of interoperability Can we work to modularizing the bigtop
> puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class.
>
> I'm not a puppet expert but, from testing on
> https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that
> HDFS dependency can make deployment a little complex (i.e. the init-hdfs
> logic etc..).
> For those of us not necessarily dependant on HDFS, this is a cumbersome
> service to maintain.
> Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
> beneficial:
> - For HDFS USers: In some use cases we might want to use bigtop to provision
> many nodes, only some of which are "data nodes". For example: Lets say our
> cluster is crawling the web in mappers, and doing some machine learning and
> distillling large pages into a small relational database tuple, i.e. that
> summarizes the "entities" in the page. In this case we don't necessarily
> benefit much from locality because we might be CPU rather than network/io
> bound. So we might want to provision a cluster of 50 machines : 40
> multicore CPU heavy ones and just 10 datanodes to support the DFS. I know
> this is an extreme case but its a good example.
> - For NON-HDFS users: One important aspect of emerging hadoop workflows is
> HCFS : https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like
> S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
> necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
> This JIRA Might have to be done in phases, and might need some refinement
> since im not a puppet expert. But here is what seems logical:
> 1) hadoop_cluster_node shouldnt necessarily know about *jobtrackers,
> tasktrackers*, or any other non essential yarn components.
> 2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node
> will need *definitions for that DFS*. These configuration properties
> (fs.defaultFS, fs.default.name, could be put into the puppet configurations
> and discovered that way).
> - fs.defaultFS
> - fs.default.name
> - fs.AbstractFileSystem
> - impl,org.apache.hadoop.fs.local....
> - fs.defaultFS
> - hbase.rootdir
> - fs......impl
> - fs.default.name
> - fs.defaultFS
> - fs.AbstractFileSystem.....impl
> - mapreduce.jobtracker.staging.root.dir
> - yarn.app.mapreduce.am.staging-dir
>
> 3) while we're at it : should the hadoop_cluster_node class even know about
> *specific ecosystem components* (zookeeper,etc..). Some tools, such as
> zookeeper, dont even need hadoop to run, so there is alot of modularization
> there to be done.
> Maybe this can be done in phases , but again, a puppet expert will have to
> weigh in on whats feasible , practical, and maybe on how to phase these
> changes in an agile way. Any feedback is welcome - i realize this is a
> significant undertaking... But its important to democratize the hadoop stack
> and bigtop is the perfect place to do it!
--
This message was sent by Atlassian JIRA
(v6.2#6252)