[jira] Commented: (HADOOP-3601) Hive as a contrib project

eric baldeschwieler (JIRA) Mon, 21 Jul 2008 22:39:33 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615525#action_12615525
 ]


eric baldeschwieler commented on HADOOP-3601:
---------------------------------------------

Hi Folks,

What follows are some thoughts I have on the general situation in Hadoop of 
adding big projects like Hive to core/contrib.  I don't think this is a 
scalable way forward and would like to use this submission as an opportunity to 
discuss the general challenges involved in welcoming new projects into the 
Hadoop family.

We've now seen 3 Hadoop projects take different courses with different results:

1) HBASE - This went into contrib.   It sat there for a number of months in 
active development before becoming a subproject.  ADVANTAGES: Good publicity 
for project.  DISADVANTAGES: Since it was very active, it frequently broke the 
hadoop core build and became a significant fraction of hadoop-dev message 
traffic.  This was somewhat disruptive to core development.  IMO this does not 
scale.  If we had several such projects running at once in core/contrib they 
would drown out the main dev community.

2) Pig - Pig went directly into the apache incubator and has ambitions to 
graduate to a Hadoop sub-project.  ADVANTAGES:  Low overhead to the hadoop 
community, lots of training for the committers on the Apache way.  
DISADVANTAGES: Less visible than HBASE, high upfront investment in project 
setup, review, committer training, approval, ...  

3) ZooKeeper - It was shared by its developers outside of apache under the BSD 
& then apache licenses, first as a posting on the Yahoo Research website and 
then as a source forge project.  ADVANTAGES: Super low cost to start, fewer 
restrictions to share code than incubation, ...  DISADVANTAGES: Less visible 
than HBASE.

----

>From these experiences, I think checking in major projects that build on 
>Hadoop into core contrib is not the most productive way to host them.  If they 
>are active, they can be very disruptive during their formation.  A project 
>should have its own email lists, tests, branches, etc, independent of Hadoop 
>mainline.

The main advantage of putting projects in core seems to be to increase their 
visibility to the Hadoop community.  I'd suggest we discuss other mechanisms.  
Long term I hope to see something like cpan.org emerge for hadoop.  But short 
term we have not IDed an entity to host such a site.

Absent that, I'd suggest a project like Hive take either the path ZooKeeper or 
Pig took.  As a community we could take some simple steps to address the short 
comings of these approaches.  An obvious step would be to invest in a well 
linked Wiki section that provides a directory of such projects.  

What do folks this of this?  Other thoughts? Suggestions?

E14

> Hive as a contrib project
> -------------------------
>
>                 Key: HADOOP-3601
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3601
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.17.0
>            Reporter: Joydeep Sen Sarma
>            Priority: Minor
>         Attachments: HiveTutorial.pdf
>
>   Original Estimate: 1080h
>  Remaining Estimate: 1080h
>
> Hive is a data warehouse built on top of flat files (stored primarily in 
> HDFS). It includes:
> - Data Organization into Tables with logical and hash partitioning
> - A Metastore to store metadata about Tables/Partitions etc
> - A SQL like query language over object data stored in Tables
> - DDL commands to define and load external data into tables
> Hive's query language is executed using Hadoop map-reduce as the execution 
> engine. Queries can use either single stage or multi-stage map-reduce. Hive 
> has a native format for tables - but can handle any data set (for example 
> json/thrift/xml) using an IO library framework.
> Hive uses Antlr for query parsing, Apache JEXL for expression evaluation and 
> may use Apache Derby as an embedded database for MetaStore. Antlr has a BSD 
> license and should be compatible with Apache license.
> We are currently thinking of contributing to the 0.17 branch as a contrib 
> project (since that is the version under which it will get tested internally) 
> - but looking for advice on the best release path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3601) Hive as a contrib project

Reply via email to