RE: A top container module like hadoop-cloud for cloud integration modules

Zheng, Kai Sun, 19 Jun 2016 03:51:13 -0700

Thanks Steve for the feedback and thoughts. 

Looks like people don't want to move around the related modules as it may not 
add much real value. It's fine. I may provide better thoughts later when learn 
the aspect deeper.

Regards,
Kai

-----Original Message-----
From: Steve Loughran [mailto:ste...@hortonworks.com] 
Sent: Wednesday, June 15, 2016 6:16 PM
To: Zheng, Kai <kai.zh...@intel.com>
Cc: common-dev@hadoop.apache.org
Subject: Re: A top container module like hadoop-cloud for cloud integration 
modules

> On 13 Jun 2016, at 14:02, Zheng, Kai <kai.zh...@intel.com> wrote:
> 
> Hi,
> 
> Noticed it's an obvious trend Hadoop is supporting more and more cloud 
> platforms, I suggest we have a top container module to hold such integration 
> modules, like the ones for aws, openstack, azure and upcoming one aliyun. The 
> rational is simple besides the trend:

I'm kind of =0 right now

> 
> 1.       Existing modules are mixed in Hadoop-tools that becomes a little big 
> being of 18 modules now. Cloud specific ones can be grouped together and 
> separated out, making more sense;

the reason for having separate hadoop-aws, hadoop-openstack modules was always 
to permit the modules to use APIs exclusive to cloud infrastructures, structure 
the downstream dependencies, *and* allow people like the EMR team to swap in 
their own closed-source version. I don't think anyone does that though.

It also lets us completely isolate testing: each module's tests only run if you 
have the credentials.

> 
> 2.       Future abstraction and common specs & codes sharing could be easier 
> or thereafter allowed;

Right now hadoop-common is where cross FS work and tests go. (Hint, reviewers 
for HADOOP-12807 needed.). I think we could start there with 
org.apache.hadoop.cloud package and only split it out if compilation ordering 
merits it —or it adds any dependencies to hadoop-common.

> 
> 3.       Common testing approach could be defined together, for example, some 
> mechanisms as discussed by Chris, Steve and Allen in HADOOP-12756;
> 

In SPARK-7481 I've added downstream tests for S3a and azure in spark; this 
shows up that S3a in Hadoop 2.6 gets its blocksize wrong (0) in listings, so 
the splits are all 1 byte wrong; work dies. I think downstream tests in: Spark, 
Hive, etc would really round out cloud infra testing, but we can't put those 
into Hadoop as the build DAG prevents it. (Reviews for SPARK-7481 needed too, 
BTW). System tests of Aliyun and perhaps GFS connectors would need to go in 
there or in bigtop —which is the other place I've discussed having cloud 
integration tests.

> 4.       Documentation for "Hadoop on Cloud"? Not sure it's needed, as we 
> already have a section for "Hadoop compatible File Systems".

Again, we can stick this in common

> 
> If sounds good, the change would be a good fit for Hadoop 3.0, even though 
> the change should not involve big impact, as it can avoid affecting the 
> artifacts. It may cause some inconveniences for the current development 
> efforts, though.
> 

I think it would make sense if other features went in. A good committer against 
object stores would be an example here: it depends on the MR libraries, so 
can't go into common.Today it'd have to go into hadoop-mapreduce. This isn't 
too bad, as long as the APIs it uses are all in hadoop-common. It's only as 
things get more complex that it matters.

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

RE: A top container module like hadoop-cloud for cloud integration modules

Reply via email to