[
https://issues.apache.org/jira/browse/BIGTOP-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jay vyas updated BIGTOP-1177:
-----------------------------
Description:
In the spirit of interoperability Can we work to modularizing the bigtop puppet
recipes to not define "hadoop_cluster_node" as an HDFS specific class.
I'm not a puppet expert but, from testing on
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic
etc..).
For those of us not necessarily dependant on HDFS, this is a cumbersome service
to maintain.
Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
beneficial:
- For HDFS USers: In some use cases we might want to use bigtop to provision
many nodes, only some of which are "data nodes". For example: Lets say our
cluster is crawling the web in mappers, and doing some machine learning and
distillling large pages into a small relational database tuple, i.e. that
summarizes the "entities" in the page. In this case we don't necessarily
benefit much from locality because we might be CPU rather than network/io
bound. So we might want to provision a cluster of 50 machines : 40 multicore
CPU heavy ones and just 10 datanodes to support the DFS. I know this is an
extreme case but its a good example.
- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3,
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
This JIRA Might have to be done in phases, and might need some refinement since
im not a puppet expert. But here is what seems logical:
- hadoop_cluster_node shouldnt necessarily know about jobtrackers,
tasktrackers, or any other non essential yarn components.
- hadoop_cluster_node shouldnt be hardcoded to deal with HDFS. filesystem
configuration properties (fs.defaultFS, fs.default.name, could be put into the
puppet configurations and discovered that way.
- fs.defaultFS
- fs.default.name
- fs.AbstractFileSystem
- impl,org.apache.hadoop.fs.local....
- fs.defaultFS
- hbase.rootdir
- fs......impl
- fs.default.name
- fs.defaultFS
- fs.AbstractFileSystem.....impl
- mapreduce.jobtracker.staging.root.dir
- yarn.app.mapreduce.am.staging-dir
- while we're at it : should the hadoop_cluster_node class even know about
specific ecosystem components (zookeeper,etc..). Some tools, such as
zookeeper, dont even need hadoop to run, so there is alot of modularization
there to be done.
was:
In the spirit of interoperability Can we work to modularizing the bigtop puppet
recipes to not define "hadoop_cluster_node" as an HDFS specific class.
I'm not a puppet expert but, from testing on
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic
etc..).
For those of us not necessarily dependant on HDFS, this is a cumbersome service
to maintain.
Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
beneficial:
- For HDFS USers: In some use cases we might want to use bigtop to provision
many nodes, only some of which are "data nodes". For example: Lets say our
cluster is crawling the web in mappers, and doing some machine learning and
distillling large pages into a small relational database tuple, i.e. that
summarizes the "entities" in the page. In this case we don't necessarily
benefit much from locality because we might be CPU rather than network/io
bound. So we might want to provision a cluster of 50 machines : 40 multicore
CPU heavy ones and just 10 datanodes to support the DFS. I know this is an
extreme case but its a good example.
- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3,
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
This JIRA Might have to be done in phases, and might need some refinement since
im not a puppet expert. But here is what seems logical:
- hadoop_cluster_node shouldnt necessarily know about jobtrackers,
tasktrackers, or any other non essential yarn components.
- hadoop_cluster_node shouldnt be hardcoded to deal with HDFS. filesystem
configuration properties (fs.defaultFS, fs.default.name, could be put into the
puppet configurations and discovered that way.
- fs.defaultFS
- fs.default.name
- fs.AbstractFileSystem
- impl,org.apache.hadoop.fs.local....
- fs.defaultFS
- hbase.rootdir
- fs......impl
- fs.default.name
- fs.defaultFS
- fs.AbstractFileSystem.....impl
- mapreduce.jobtracker.staging.root.dir
- yarn.app.mapreduce.am.staging-dir
- while we're at it : should the hadoop_cluster_node class even know about
specific ecosystem components (zookeeper,etc..). Some tools, such as
zookeeper, dont even need hadoop to run, so there is alot of modularization
there to be done.
> Puppet Recipes: Can we modularize them to foster HCFS initiatives?
> ------------------------------------------------------------------
>
> Key: BIGTOP-1177
> URL: https://issues.apache.org/jira/browse/BIGTOP-1177
> Project: Bigtop
> Issue Type: Improvement
> Reporter: jay vyas
>
> In the spirit of interoperability Can we work to modularizing the bigtop
> puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class.
>
> I'm not a puppet expert but, from testing on
> https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that
> HDFS dependency can make deployment a little complex (i.e. the init-hdfs
> logic etc..).
> For those of us not necessarily dependant on HDFS, this is a cumbersome
> service to maintain.
> Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
> beneficial:
> - For HDFS USers: In some use cases we might want to use bigtop to provision
> many nodes, only some of which are "data nodes". For example: Lets say our
> cluster is crawling the web in mappers, and doing some machine learning and
> distillling large pages into a small relational database tuple, i.e. that
> summarizes the "entities" in the page. In this case we don't necessarily
> benefit much from locality because we might be CPU rather than network/io
> bound. So we might want to provision a cluster of 50 machines : 40
> multicore CPU heavy ones and just 10 datanodes to support the DFS. I know
> this is an extreme case but its a good example.
> - For NON-HDFS users: One important aspect of emerging hadoop workflows is
> HCFS : https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like
> S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
> necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
> This JIRA Might have to be done in phases, and might need some refinement
> since im not a puppet expert. But here is what seems logical:
> - hadoop_cluster_node shouldnt necessarily know about jobtrackers,
> tasktrackers, or any other non essential yarn components.
> - hadoop_cluster_node shouldnt be hardcoded to deal with HDFS. filesystem
> configuration properties (fs.defaultFS, fs.default.name, could be put into
> the puppet configurations and discovered that way.
> - fs.defaultFS
> - fs.default.name
> - fs.AbstractFileSystem
> - impl,org.apache.hadoop.fs.local....
> - fs.defaultFS
> - hbase.rootdir
> - fs......impl
> - fs.default.name
> - fs.defaultFS
> - fs.AbstractFileSystem.....impl
> - mapreduce.jobtracker.staging.root.dir
> - yarn.app.mapreduce.am.staging-dir
>
> - while we're at it : should the hadoop_cluster_node class even know about
> specific ecosystem components (zookeeper,etc..). Some tools, such as
> zookeeper, dont even need hadoop to run, so there is alot of modularization
> there to be done.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)