[
https://issues.apache.org/jira/browse/BIGTOP-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Boudnik updated BIGTOP-1177:
---------------------------------------
Description:
In the spirit of interoperability Can we work to modularizing the bigtop puppet
recipes to not define "hadoop_cluster_node" as an HDFS specific class.
I'm not a puppet expert but, from testing on BIGTOP-1171, im starting to note
that HDFS dependency can make deployment a little complex (i.e. the init-hdfs
logic etc..).
For those of us not necessarily dependant on HDFS, this is a cumbersome service
to maintain.
Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
beneficial:
- For HDFS USers: In some use cases we might want to use bigtop to provision
many nodes, only some of which are "data nodes". For example: Lets say our
cluster is crawling the web in mappers, and doing some machine learning and
distillling large pages into a small relational database tuple, i.e. that
summarizes the "entities" in the page. In this case we don't necessarily
benefit much from locality because we might be CPU rather than network/io
bound. So we might want to provision a cluster of 50 machines : 40 multicore
CPU heavy ones and just 10 datanodes to support the DFS. I know this is an
extreme case but its a good example.
- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3,
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
This JIRA Might have to be done in phases, and might need some refinement since
im not a puppet expert. But here is what seems logical:
1) hadoop_cluster_node shouldnt necessarily know about *jobtrackers,
tasktrackers*, or any other non essential yarn components.
2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will
need *definitions for that DFS*. These configuration properties (fs.defaultFS,
fs.default.name, could be put into the puppet configurations and discovered
that way).
- fs.defaultFS
- fs.default.name
- fs.AbstractFileSystem
- impl,org.apache.hadoop.fs.local....
- fs.defaultFS
- hbase.rootdir
- fs......impl
- fs.default.name
- fs.defaultFS
- fs.AbstractFileSystem.....impl
- mapreduce.jobtracker.staging.root.dir
- yarn.app.mapreduce.am.staging-dir
3) while we're at it : should the hadoop_cluster_node class even know about
*specific ecosystem components* (zookeeper,etc..). Some tools, such as
zookeeper, dont even need hadoop to run, so there is alot of modularization
there to be done.
Maybe this can be done in phases , but again, a puppet expert will have to
weigh in on whats feasible , practical, and maybe on how to phase these changes
in an agile way. Any feedback is welcome - i realize this is a significant
undertaking... But its important to democratize the hadoop stack and bigtop is
the perfect place to do it!
was:
In the spirit of interoperability Can we work to modularizing the bigtop puppet
recipes to not define "hadoop_cluster_node" as an HDFS specific class.
I'm not a puppet expert but, from testing on
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic
etc..).
For those of us not necessarily dependant on HDFS, this is a cumbersome service
to maintain.
Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
beneficial:
- For HDFS USers: In some use cases we might want to use bigtop to provision
many nodes, only some of which are "data nodes". For example: Lets say our
cluster is crawling the web in mappers, and doing some machine learning and
distillling large pages into a small relational database tuple, i.e. that
summarizes the "entities" in the page. In this case we don't necessarily
benefit much from locality because we might be CPU rather than network/io
bound. So we might want to provision a cluster of 50 machines : 40 multicore
CPU heavy ones and just 10 datanodes to support the DFS. I know this is an
extreme case but its a good example.
- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3,
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
This JIRA Might have to be done in phases, and might need some refinement since
im not a puppet expert. But here is what seems logical:
1) hadoop_cluster_node shouldnt necessarily know about *jobtrackers,
tasktrackers*, or any other non essential yarn components.
2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will
need *definitions for that DFS*. These configuration properties (fs.defaultFS,
fs.default.name, could be put into the puppet configurations and discovered
that way).
- fs.defaultFS
- fs.default.name
- fs.AbstractFileSystem
- impl,org.apache.hadoop.fs.local....
- fs.defaultFS
- hbase.rootdir
- fs......impl
- fs.default.name
- fs.defaultFS
- fs.AbstractFileSystem.....impl
- mapreduce.jobtracker.staging.root.dir
- yarn.app.mapreduce.am.staging-dir
3) while we're at it : should the hadoop_cluster_node class even know about
*specific ecosystem components* (zookeeper,etc..). Some tools, such as
zookeeper, dont even need hadoop to run, so there is alot of modularization
there to be done.
Maybe this can be done in phases , but again, a puppet expert will have to
weigh in on whats feasible , practical, and maybe on how to phase these changes
in an agile way. Any feedback is welcome - i realize this is a significant
undertaking... But its important to democratize the hadoop stack and bigtop is
the perfect place to do it!
> Puppet Recipes: Can we modularize them to foster HCFS initiatives?
> ------------------------------------------------------------------
>
> Key: BIGTOP-1177
> URL: https://issues.apache.org/jira/browse/BIGTOP-1177
> Project: Bigtop
> Issue Type: Improvement
> Components: Deployment
> Affects Versions: 0.7.0
> Reporter: jay vyas
> Fix For: backlog
>
>
> In the spirit of interoperability Can we work to modularizing the bigtop
> puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class.
>
> I'm not a puppet expert but, from testing on BIGTOP-1171, im starting to note
> that HDFS dependency can make deployment a little complex (i.e. the init-hdfs
> logic etc..).
> For those of us not necessarily dependant on HDFS, this is a cumbersome
> service to maintain.
> Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is
> beneficial:
> - For HDFS USers: In some use cases we might want to use bigtop to provision
> many nodes, only some of which are "data nodes". For example: Lets say our
> cluster is crawling the web in mappers, and doing some machine learning and
> distillling large pages into a small relational database tuple, i.e. that
> summarizes the "entities" in the page. In this case we don't necessarily
> benefit much from locality because we might be CPU rather than network/io
> bound. So we might want to provision a cluster of 50 machines : 40
> multicore CPU heavy ones and just 10 datanodes to support the DFS. I know
> this is an extreme case but its a good example.
> - For NON-HDFS users: One important aspect of emerging hadoop workflows is
> HCFS : https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like
> S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not
> necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
> This JIRA Might have to be done in phases, and might need some refinement
> since im not a puppet expert. But here is what seems logical:
> 1) hadoop_cluster_node shouldnt necessarily know about *jobtrackers,
> tasktrackers*, or any other non essential yarn components.
> 2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node
> will need *definitions for that DFS*. These configuration properties
> (fs.defaultFS, fs.default.name, could be put into the puppet configurations
> and discovered that way).
> - fs.defaultFS
> - fs.default.name
> - fs.AbstractFileSystem
> - impl,org.apache.hadoop.fs.local....
> - fs.defaultFS
> - hbase.rootdir
> - fs......impl
> - fs.default.name
> - fs.defaultFS
> - fs.AbstractFileSystem.....impl
> - mapreduce.jobtracker.staging.root.dir
> - yarn.app.mapreduce.am.staging-dir
>
> 3) while we're at it : should the hadoop_cluster_node class even know about
> *specific ecosystem components* (zookeeper,etc..). Some tools, such as
> zookeeper, dont even need hadoop to run, so there is alot of modularization
> there to be done.
> Maybe this can be done in phases , but again, a puppet expert will have to
> weigh in on whats feasible , practical, and maybe on how to phase these
> changes in an agile way. Any feedback is welcome - i realize this is a
> significant undertaking... But its important to democratize the hadoop stack
> and bigtop is the perfect place to do it!
--
This message was sent by Atlassian JIRA
(v6.2#6252)