[ 
https://issues.apache.org/jira/browse/BIGTOP-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated BIGTOP-1177:
-----------------------------

    Description: 
In the spirit of interoperability Can we work to modularizing the bigtop puppet 
recipes to not define "hadoop_cluster_node" as an HDFS specific class.  

I'm not a puppet expert but, from testing on 
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that 
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic 
etc..).  

For those of us not necessarily dependant on HDFS, this is a cumbersome service 
to maintain.

Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is 
beneficial:  

- For  HDFS USers: In some use cases we might want to use bigtop to provision 
many nodes, only some of which are "data nodes".    For example: Lets say our 
cluster is crawling the web in mappers, and doing some machine learning and 
distillling large pages into  a small relational database tuple, i.e. that 
summarizes the "entities" in the page.  In this case we don't necessarily 
benefit much from locality because we might be CPU rather than network/io 
bound.   So we might want to provision a cluster of 50 machines : 40 multicore 
CPU heavy ones and just 10 datanodes to support the DFS.   I know this is an 
extreme case but its a good example.  

- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS 
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3, 
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not 
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.   

This JIRA Might have to be done in phases, and might need some refinement since 
im not a puppet expert.  But here is what seems logical: 

- hadoop_cluster_node shouldnt necessarily  know about jobtrackers, 
tasktrackers, or any other non essential yarn components. 

- hadoop_cluster_node shouldnt be hardcoded to deal with HDFS.  filesystem 
configuration properties (fs.defaultFS, fs.default.name, could be put into the 
puppet configurations and discovered that way.
   - fs.defaultFS
   - fs.default.name
   - fs.AbstractFileSystem
   - impl,org.apache.hadoop.fs.local....
   - fs.defaultFS
   - hbase.rootdir
   - fs......impl
   - fs.default.name
   - fs.defaultFS
   - fs.AbstractFileSystem.....impl
   - mapreduce.jobtracker.staging.root.dir
   - yarn.app.mapreduce.am.staging-dir
 
- while we're at it : should the hadoop_cluster_node class even know about 
specific ecosystem components (zookeeper,etc..).  Some tools, such as 
zookeeper, dont even need hadoop to run, so there is alot of modularization 
there to be done. 








  was:
In the spirit of interoperability Can we work to modularizing the bigtop puppet 
recipes to not define "hadoop_cluster_node" as an HDFS specific class.  

I'm not a puppet expert but, from testing on 
https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that 
HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic 
etc..).  

For those of us not necessarily dependant on HDFS, this is a cumbersome service 
to maintain.

Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is 
beneficial:  

- For  HDFS USers: In some use cases we might want to use bigtop to provision 
many nodes, only some of which are "data nodes".    For example: Lets say our 
cluster is crawling the web in mappers, and doing some machine learning and 
distillling large pages into  a small relational database tuple, i.e. that 
summarizes the "entities" in the page.  In this case we don't necessarily 
benefit much from locality because we might be CPU rather than network/io 
bound.   So we might want to provision a cluster of 50 machines : 40 multicore 
CPU heavy ones and just 10 datanodes to support the DFS.   I know this is an 
extreme case but its a good example.  

- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS 
: https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3, 
OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not 
necessarily optimal, of supporting YARN and Hadoop operations as HDFS.   

This JIRA Might have to be done in phases, and might need some refinement since 
im not a puppet expert.  But here is what seems logical: 

- hadoop_cluster_node shouldnt necessarily  know about jobtrackers, 
tasktrackers, or any other non essential yarn components. 
- hadoop_cluster_node shouldnt be hardcoded to deal with HDFS.  filesystem 
configuration properties (fs.defaultFS, fs.default.name, could be put into the 
puppet configurations and discovered that way.
 - fs.defaultFS
 - fs.default.name
  - fs.AbstractFileSystem
 - impl,org.apache.hadoop.fs.local....
 - fs.defaultFS
 - hbase.rootdir
 - fs......impl
 - fs.default.name
 - fs.defaultFS
 - fs.AbstractFileSystem.....impl
 - mapreduce.jobtracker.staging.root.dir
 - yarn.app.mapreduce.am.staging-dir
- while we're at it : should the hadoop_cluster_node class even know about 
specific ecosystem components (zookeeper,etc..).  Some tools, such as 
zookeeper, dont even need hadoop to run, so there is alot of modularization 
there to be done. 









> Puppet Recipes: Can we modularize them to foster HCFS initiatives?
> ------------------------------------------------------------------
>
>                 Key: BIGTOP-1177
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1177
>             Project: Bigtop
>          Issue Type: Improvement
>            Reporter: jay vyas
>
> In the spirit of interoperability Can we work to modularizing the bigtop 
> puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class. 
>  
> I'm not a puppet expert but, from testing on 
> https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that 
> HDFS dependency can make deployment a little complex (i.e. the init-hdfs 
> logic etc..).  
> For those of us not necessarily dependant on HDFS, this is a cumbersome 
> service to maintain.
> Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is 
> beneficial:  
> - For  HDFS USers: In some use cases we might want to use bigtop to provision 
> many nodes, only some of which are "data nodes".    For example: Lets say our 
> cluster is crawling the web in mappers, and doing some machine learning and 
> distillling large pages into  a small relational database tuple, i.e. that 
> summarizes the "entities" in the page.  In this case we don't necessarily 
> benefit much from locality because we might be CPU rather than network/io 
> bound.   So we might want to provision a cluster of 50 machines : 40 
> multicore CPU heavy ones and just 10 datanodes to support the DFS.   I know 
> this is an extreme case but its a good example.  
> - For NON-HDFS users: One important aspect of emerging hadoop workflows is 
> HCFS : https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like 
> S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not 
> necessarily optimal, of supporting YARN and Hadoop operations as HDFS.   
> This JIRA Might have to be done in phases, and might need some refinement 
> since im not a puppet expert.  But here is what seems logical: 
> - hadoop_cluster_node shouldnt necessarily  know about jobtrackers, 
> tasktrackers, or any other non essential yarn components. 
> - hadoop_cluster_node shouldnt be hardcoded to deal with HDFS.  filesystem 
> configuration properties (fs.defaultFS, fs.default.name, could be put into 
> the puppet configurations and discovered that way.
>    - fs.defaultFS
>    - fs.default.name
>    - fs.AbstractFileSystem
>    - impl,org.apache.hadoop.fs.local....
>    - fs.defaultFS
>    - hbase.rootdir
>    - fs......impl
>    - fs.default.name
>    - fs.defaultFS
>    - fs.AbstractFileSystem.....impl
>    - mapreduce.jobtracker.staging.root.dir
>    - yarn.app.mapreduce.am.staging-dir
>  
> - while we're at it : should the hadoop_cluster_node class even know about 
> specific ecosystem components (zookeeper,etc..).  Some tools, such as 
> zookeeper, dont even need hadoop to run, so there is alot of modularization 
> there to be done. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to