[ 
https://issues.apache.org/jira/browse/BIGTOP-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated BIGTOP-1177:
-----------------------------

    Description: 
In the spirit of interoperability Can we work to modularizing the bigtop puppet 
recipes to not define "hadoop_cluster_node" as an HDFS specific class.  

I'm not a puppet expert but, from testing on 
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that 
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic 
etc..).  

For those of us not necessarily dependant on HDFS, this is a cumbersome service 
to maintain.

Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is 
beneficial:  

- For  HDFS USers: In some use cases we might want to use bigtop to provision 
many nodes, only some of which are "data nodes".    For example: Lets say our 
cluster is crawling the web in mappers, and doing some machine learning and 
distillling large pages into  a small relational database tuple, i.e. that 
summarizes the "entities" in the page.  In this case we don't necessarily 
benefit much from locality because we might be CPU rather than network/io 
bound.   So we might want to provision a cluster of 50 machines : 40 multicore 
CPU heavy ones and just 10 datanodes to support the DFS.   I know this is an 
extreme case but its a good example.  

- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS 
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3, 
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not 
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.   

This JIRA Might have to be done in phases, and might need some refinement since 
im not a puppet expert.  But here is what seems logical: 

1) hadoop_cluster_node shouldnt necessarily  know about *jobtrackers, 
tasktrackers*, or any other non essential yarn components. 

2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will 
need *definitions for  that DFS*. These configuration properties (fs.defaultFS, 
fs.default.name, could be put into the puppet configurations and discovered 
that way).
   - fs.defaultFS
   - fs.default.name
   - fs.AbstractFileSystem
   - impl,org.apache.hadoop.fs.local....
   - fs.defaultFS
   - hbase.rootdir
   - fs......impl
   - fs.default.name
   - fs.defaultFS
   - fs.AbstractFileSystem.....impl
   - mapreduce.jobtracker.staging.root.dir
   - yarn.app.mapreduce.am.staging-dir
 
3) while we're at it : should the hadoop_cluster_node class even know about 
*specific ecosystem components* (zookeeper,etc..).  Some tools, such as 
zookeeper, dont even need hadoop to run, so there is alot of modularization 
there to be done. 

Maybe this can be done in phases , but again, a puppet expert will have to 
weigh in on whats feasible , practical, and maybe on how to phase these changes 
in an agile way.  Any feedback is welcome - i realize this is a significant 
undertaking...  But its important to democratize the hadoop stack and bigtop is 
the perfect place to do it!

  was:
In the spirit of interoperability Can we work to modularizing the bigtop puppet 
recipes to not define "hadoop_cluster_node" as an HDFS specific class.  

I'm not a puppet expert but, from testing on 
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that 
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic 
etc..).  

For those of us not necessarily dependant on HDFS, this is a cumbersome service 
to maintain.

Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is 
beneficial:  

- For  HDFS USers: In some use cases we might want to use bigtop to provision 
many nodes, only some of which are "data nodes".    For example: Lets say our 
cluster is crawling the web in mappers, and doing some machine learning and 
distillling large pages into  a small relational database tuple, i.e. that 
summarizes the "entities" in the page.  In this case we don't necessarily 
benefit much from locality because we might be CPU rather than network/io 
bound.   So we might want to provision a cluster of 50 machines : 40 multicore 
CPU heavy ones and just 10 datanodes to support the DFS.   I know this is an 
extreme case but its a good example.  

- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS 
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3, 
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not 
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.   

This JIRA Might have to be done in phases, and might need some refinement since 
im not a puppet expert.  But here is what seems logical: 

1) hadoop_cluster_node shouldnt necessarily  know about jobtrackers, 
tasktrackers, or any other non essential yarn components. 

2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will 
need definitions for  that DFS. These configuration properties (fs.defaultFS, 
fs.default.name, could be put into the puppet configurations and discovered 
that way).
   - fs.defaultFS
   - fs.default.name
   - fs.AbstractFileSystem
   - impl,org.apache.hadoop.fs.local....
   - fs.defaultFS
   - hbase.rootdir
   - fs......impl
   - fs.default.name
   - fs.defaultFS
   - fs.AbstractFileSystem.....impl
   - mapreduce.jobtracker.staging.root.dir
   - yarn.app.mapreduce.am.staging-dir
 
3) while we're at it : should the hadoop_cluster_node class even know about 
specific ecosystem components' (zookeeper,etc..).  Some tools, such as 
zookeeper, dont even need hadoop to run, so there is alot of modularization 
there to be done. 



> Puppet Recipes: Can we modularize them to foster HCFS initiatives?
> ------------------------------------------------------------------
>
>                 Key: BIGTOP-1177
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1177
>             Project: Bigtop
>          Issue Type: Improvement
>            Reporter: jay vyas
>
> In the spirit of interoperability Can we work to modularizing the bigtop 
> puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class. 
>  
> I'm not a puppet expert but, from testing on 
> https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that 
> HDFS dependency can make deployment a little complex (i.e. the init-hdfs 
> logic etc..).  
> For those of us not necessarily dependant on HDFS, this is a cumbersome 
> service to maintain.
> Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is 
> beneficial:  
> - For  HDFS USers: In some use cases we might want to use bigtop to provision 
> many nodes, only some of which are "data nodes".    For example: Lets say our 
> cluster is crawling the web in mappers, and doing some machine learning and 
> distillling large pages into  a small relational database tuple, i.e. that 
> summarizes the "entities" in the page.  In this case we don't necessarily 
> benefit much from locality because we might be CPU rather than network/io 
> bound.   So we might want to provision a cluster of 50 machines : 40 
> multicore CPU heavy ones and just 10 datanodes to support the DFS.   I know 
> this is an extreme case but its a good example.  
> - For NON-HDFS users: One important aspect of emerging hadoop workflows is 
> HCFS : https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like 
> S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not 
> necessarily optimal, of supporting YARN and Hadoop operations as HDFS.   
> This JIRA Might have to be done in phases, and might need some refinement 
> since im not a puppet expert.  But here is what seems logical: 
> 1) hadoop_cluster_node shouldnt necessarily  know about *jobtrackers, 
> tasktrackers*, or any other non essential yarn components. 
> 2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node 
> will need *definitions for  that DFS*. These configuration properties 
> (fs.defaultFS, fs.default.name, could be put into the puppet configurations 
> and discovered that way).
>    - fs.defaultFS
>    - fs.default.name
>    - fs.AbstractFileSystem
>    - impl,org.apache.hadoop.fs.local....
>    - fs.defaultFS
>    - hbase.rootdir
>    - fs......impl
>    - fs.default.name
>    - fs.defaultFS
>    - fs.AbstractFileSystem.....impl
>    - mapreduce.jobtracker.staging.root.dir
>    - yarn.app.mapreduce.am.staging-dir
>  
> 3) while we're at it : should the hadoop_cluster_node class even know about 
> *specific ecosystem components* (zookeeper,etc..).  Some tools, such as 
> zookeeper, dont even need hadoop to run, so there is alot of modularization 
> there to be done. 
> Maybe this can be done in phases , but again, a puppet expert will have to 
> weigh in on whats feasible , practical, and maybe on how to phase these 
> changes in an agile way.  Any feedback is welcome - i realize this is a 
> significant undertaking...  But its important to democratize the hadoop stack 
> and bigtop is the perfect place to do it!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to