Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-07 Thread Owen O'Malley
With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you
start the process for setting up the project?

Thanks, all.

-- Owen


On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org wrote:

 On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org
 wrote:

  +1 (binding)
 
 
 Just as an FYI, I'm also a mentor of this project.

 --
 Best Regards,
 -- Alex



Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-07 Thread Jakob Homan
Will do.  This is going to be fun.


On Thu, Mar 7, 2013 at 4:17 PM, Owen O'Malley omal...@apache.org wrote:

 With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you
 start the process for setting up the project?

 Thanks, all.

 -- Owen


 On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org
 wrote:

  On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org
  wrote:
 
   +1 (binding)
  
  
  Just as an FYI, I'm also a mentor of this project.
 
  --
  Best Regards,
  -- Alex
 



Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-07 Thread edward yoon

Congratz guys!

On 3/8/2013 9:17 AM, Owen O'Malley wrote:

With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you
start the process for setting up the project?

Thanks, all.

-- Owen


On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org wrote:


On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org
wrote:


+1 (binding)



Just as an FYI, I'm also a mentor of this project.

--
Best Regards,
-- Alex



--
Best Regards, Edward J. Yoon
@eddieyoon


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-07 Thread Alex Karasulu
Indeed congratulations everyone!


On Fri, Mar 8, 2013 at 3:50 AM, edward yoon edward.y...@oracle.com wrote:

 Congratz guys!


 On 3/8/2013 9:17 AM, Owen O'Malley wrote:

 With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can
 you
 start the process for setting up the project?

 Thanks, all.

 -- Owen


 On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org
 wrote:

  On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org
 wrote:

  +1 (binding)


  Just as an FYI, I'm also a mentor of this project.

 --
 Best Regards,
 -- Alex


 --
 Best Regards, Edward J. Yoon
 @eddieyoon



 --**--**-
 To unsubscribe, e-mail: 
 general-unsubscribe@incubator.**apache.orggeneral-unsubscr...@incubator.apache.org
 For additional commands, e-mail: 
 general-help@incubator.apache.**orggeneral-h...@incubator.apache.org




-- 
Best Regards,
-- Alex


Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-04 Thread Mattmann, Chris A (388J)
+1 (binding) from me.

Cheers,
Chris


On 2/28/13 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:

Hi Folks,

I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
The vote will close on Mar 7 at 6:00 PM (PST).

[] +1 Accept Tajo into the Apache incubator
[] +0 Don't care.
[] -1 Don't accept Tajo into the incubator because...

Full proposal is pasted at the bottom on this email, and the corresponding
wiki is http://wiki.apache.org/incubator/TajoProposal.

Only VOTEs from Incubator PMC members are binding, but all are welcome to
express their thoughts.

Thanks,
Hyunsik

PS: From the initial discussion, the main changes are that I've added 4
new
committers. Also, I've revised some description of Known Risks because the
initial committers have been diverse.


Tajo Proposal

= Abstract =

Tajo is a distributed data warehouse system for Hadoop.


= Proposal =

Tajo is a relational and distributed data warehouse system for Hadoop.
Tajo
is designed for low-latency and scalable ad-hoc queries, online
aggregation
and ETL on large-data sets by leveraging advanced database techniques. It
supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
and it has its own query engine which allows direct control of distributed
execution and data flow. As a result, Tajo has a variety of query
evaluation strategies and more optimization opportunities. In addition,
Tajo will have a native columnar execution and and its optimizer. Tajo
will
be an alternative choice to Hive/Pig on the top of MapReduce.


= Background =

Big data analysis has gained much attention in the industrial. Open source
communities have proposed scalable and distributed solutions for ad-hoc
queries on big data. However, there is still room for improvement. Markets
need more faster and efficient solutions. Recently, some alternatives
(e.g., Cloudera's Impala and Amazon Redshift) have come out.


= Rationale =

There are a variety of open source distributed execution engines (e.g.,
hive, and pig) running on the top of MapReduce. They are limited by MR
framework. They cannot directly control distributed execution and data
flow, and they just use MR framework. So, they have limited query
evaluation strategies and optimization opportunities. It is hard for them
to be optimized for a certain type of data processing.


= Initial Goals =

The initial goal is to write more documents to describe Tajo's internal.
It
will be helpful to recruit more committers and to build a solid community.
Then, we will make milestones for short/long term plans.


= Current Status =

Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
selection, projection, group-by, join, union and sort) except for nested
queries. Tajo provides various row/column storage formats, such as CSV,
RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
also has a rudimentary ETL feature to transform one data format to another
data format. In addition, Tajo provides hash and range repartitions. By
using both repartition methods, Tajo processes aggregation, join, and sort
queries over a number of cluster nodes. To evaluate the performance, we
have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.


== Meritocracy ==

We will discuss the milestone and the future plan in an open forum. We
plan
to encourage an environment that supports a meritocracy. The contributors
will have different privileges according to their contributions.


== Community ==

Big data analysis has gained attention from open source communities,
industrial and academic areas. Some projects related to Hadoop already
have
very large and active communities. We expect that Tajo also will establish
an active community. Since Tajo already works for some features and is in
the alpha stage, it will attract a large community soon.


== Core Developers ==

Core developers are a diverse group of developers, many of which are very
experienced in open source and the Apache Hadoop ecosystem.

 * Eli Reisman ereisman AT apache DOT org

 * Henry Saputra hsaputra AT apache DOT org

 * Hyunsik Choi hyunsik AT apache DOT org

 * Jae Hwa Jung jhjung AT gruter DOT com

 * Jihoon Son ghoonson AT gmail DOT com

 * Jin Ho Kim jhkim AT gruter DOT com

 * Roshan Sumbaly rsumbaly AT gmail DOT com

 * Sangwook Kim swkim AT inervit DOT com

 * Yi A Liu yi DOT a DOT liu AT intel DOT com


== Alignment ==

Tajo employs Apache Hadoop Yarn as a resource management platform for
large
clusters. It uses HDFS as a primary storage layer. It already supports
Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
addition, we have a plan to integrate Tajo with other products of Hadoop
ecosystem. Tajo's modules are well organized, and these modules can also
be
used for other projects.


= Known Risks =

== Orphaned Products ==

Most of codes have been developed by only two core 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-04 Thread Alex Karasulu
On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org wrote:

 +1 (binding)


Just as an FYI, I'm also a mentor of this project.

-- 
Best Regards,
-- Alex


Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-02 Thread Andrew Purtell
+1 (non binding)

Would be interested in helping out with HBase integration, and Bigtop
packaging.


On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:
  Hi Folks,
 
  I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
  The vote will close on Mar 7 at 6:00 PM (PST).
 
  [] +1 Accept Tajo into the Apache incubator
  [] +0 Don't care.
  [] -1 Don't accept Tajo into the incubator because...
 
  Full proposal is pasted at the bottom on this email, and the
 corresponding
  wiki is http://wiki.apache.org/incubator/TajoProposal.
 
  Only VOTEs from Incubator PMC members are binding, but all are welcome to
  express their thoughts.
 
  Thanks,
  Hyunsik
 
  PS: From the initial discussion, the main changes are that I've added 4
 new
  committers. Also, I've revised some description of Known Risks because
 the
  initial committers have been diverse.
 
  
  Tajo Proposal
 
  = Abstract =
 
  Tajo is a distributed data warehouse system for Hadoop.
 
 
  = Proposal =
 
  Tajo is a relational and distributed data warehouse system for Hadoop.
 Tajo
  is designed for low-latency and scalable ad-hoc queries, online
 aggregation
  and ETL on large-data sets by leveraging advanced database techniques. It
  supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
  Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
  and it has its own query engine which allows direct control of
 distributed
  execution and data flow. As a result, Tajo has a variety of query
  evaluation strategies and more optimization opportunities. In addition,
  Tajo will have a native columnar execution and and its optimizer. Tajo
 will
  be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
  = Background =
 
  Big data analysis has gained much attention in the industrial. Open
 source
  communities have proposed scalable and distributed solutions for ad-hoc
  queries on big data. However, there is still room for improvement.
 Markets
  need more faster and efficient solutions. Recently, some alternatives
  (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
  = Rationale =
 
  There are a variety of open source distributed execution engines (e.g.,
  hive, and pig) running on the top of MapReduce. They are limited by MR
  framework. They cannot directly control distributed execution and data
  flow, and they just use MR framework. So, they have limited query
  evaluation strategies and optimization opportunities. It is hard for them
  to be optimized for a certain type of data processing.
 
 
  = Initial Goals =
 
  The initial goal is to write more documents to describe Tajo's internal.
 It
  will be helpful to recruit more committers and to build a solid
 community.
  Then, we will make milestones for short/long term plans.
 
 
  = Current Status =
 
  Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
  selection, projection, group-by, join, union and sort) except for nested
  queries. Tajo provides various row/column storage formats, such as CSV,
  RowFile (a row-store file we have implemented), RCFile, and Trevni, and
 it
  also has a rudimentary ETL feature to transform one data format to
 another
  data format. In addition, Tajo provides hash and range repartitions. By
  using both repartition methods, Tajo processes aggregation, join, and
 sort
  queries over a number of cluster nodes. To evaluate the performance, we
  have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
  == Meritocracy ==
 
  We will discuss the milestone and the future plan in an open forum. We
 plan
  to encourage an environment that supports a meritocracy. The contributors
  will have different privileges according to their contributions.
 
 
  == Community ==
 
  Big data analysis has gained attention from open source communities,
  industrial and academic areas. Some projects related to Hadoop already
 have
  very large and active communities. We expect that Tajo also will
 establish
  an active community. Since Tajo already works for some features and is in
  the alpha stage, it will attract a large community soon.
 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.orgjavascript:;
 For additional commands, e-mail: 
 general-h...@incubator.apache.orgjavascript:;



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)


Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-01 Thread Steve Loughran
On 28 February 2013 18:11, Hyunsik Choi hyun...@apache.org wrote:

 Hi Folks,

 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).

 [X] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...


+1, binding.

It'll not only move the hadoop stack up, but act as more regression tests
to the layers below. And test are always welcome


Re: [VOTE] Accept Tajo into the Apache Incubator

2013-03-01 Thread Alex Karasulu
+1 (binding)


On Sat, Mar 2, 2013 at 2:16 AM, Roman Shaposhnik ro...@shaposhnik.orgwrote:

 +1 (binding).

 I would also encourage you guys to take a look at Apache Bigtop
 as a way of integrating with the rest of Hadoop ecosystem and
 bring more testing into the fold.

 Looking forward to working with you!

 Thanks,
 Roman.

 On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:
  Hi Folks,
 
  I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
  The vote will close on Mar 7 at 6:00 PM (PST).
 
  [] +1 Accept Tajo into the Apache incubator
  [] +0 Don't care.
  [] -1 Don't accept Tajo into the incubator because...
 
  Full proposal is pasted at the bottom on this email, and the
 corresponding
  wiki is http://wiki.apache.org/incubator/TajoProposal.
 
  Only VOTEs from Incubator PMC members are binding, but all are welcome to
  express their thoughts.
 
  Thanks,
  Hyunsik
 
  PS: From the initial discussion, the main changes are that I've added 4
 new
  committers. Also, I've revised some description of Known Risks because
 the
  initial committers have been diverse.
 
  
  Tajo Proposal
 
  = Abstract =
 
  Tajo is a distributed data warehouse system for Hadoop.
 
 
  = Proposal =
 
  Tajo is a relational and distributed data warehouse system for Hadoop.
 Tajo
  is designed for low-latency and scalable ad-hoc queries, online
 aggregation
  and ETL on large-data sets by leveraging advanced database techniques. It
  supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
  Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
  and it has its own query engine which allows direct control of
 distributed
  execution and data flow. As a result, Tajo has a variety of query
  evaluation strategies and more optimization opportunities. In addition,
  Tajo will have a native columnar execution and and its optimizer. Tajo
 will
  be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
  = Background =
 
  Big data analysis has gained much attention in the industrial. Open
 source
  communities have proposed scalable and distributed solutions for ad-hoc
  queries on big data. However, there is still room for improvement.
 Markets
  need more faster and efficient solutions. Recently, some alternatives
  (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
  = Rationale =
 
  There are a variety of open source distributed execution engines (e.g.,
  hive, and pig) running on the top of MapReduce. They are limited by MR
  framework. They cannot directly control distributed execution and data
  flow, and they just use MR framework. So, they have limited query
  evaluation strategies and optimization opportunities. It is hard for them
  to be optimized for a certain type of data processing.
 
 
  = Initial Goals =
 
  The initial goal is to write more documents to describe Tajo's internal.
 It
  will be helpful to recruit more committers and to build a solid
 community.
  Then, we will make milestones for short/long term plans.
 
 
  = Current Status =
 
  Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
  selection, projection, group-by, join, union and sort) except for nested
  queries. Tajo provides various row/column storage formats, such as CSV,
  RowFile (a row-store file we have implemented), RCFile, and Trevni, and
 it
  also has a rudimentary ETL feature to transform one data format to
 another
  data format. In addition, Tajo provides hash and range repartitions. By
  using both repartition methods, Tajo processes aggregation, join, and
 sort
  queries over a number of cluster nodes. To evaluate the performance, we
  have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
  == Meritocracy ==
 
  We will discuss the milestone and the future plan in an open forum. We
 plan
  to encourage an environment that supports a meritocracy. The contributors
  will have different privileges according to their contributions.
 
 
  == Community ==
 
  Big data analysis has gained attention from open source communities,
  industrial and academic areas. Some projects related to Hadoop already
 have
  very large and active communities. We expect that Tajo also will
 establish
  an active community. Since Tajo already works for some features and is in
  the alpha stage, it will attract a large community soon.
 
 
  == Core Developers ==
 
  Core developers are a diverse group of developers, many of which are very
  experienced in open source and the Apache Hadoop ecosystem.
 
   * Eli Reisman ereisman AT apache DOT org
 
   * Henry Saputra hsaputra AT apache DOT org
 
   * Hyunsik Choi hyunsik AT apache DOT org
 
   * Jae Hwa Jung jhjung AT gruter DOT com
 
   * Jihoon Son ghoonson AT gmail DOT com
 
   * Jin Ho Kim jhkim AT gruter DOT com
 
   * Roshan Sumbaly rsumbaly AT gmail DOT com
 
   * Sangwook Kim swkim AT inervit DOT com
 
   * Yi A Liu yi DOT a DOT liu AT intel DOT 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Jakob Homan
+1 (binding).  This is a great addition to the Incubator.


On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:

 Hi Folks,

 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).

 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...

 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.

 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.

 Thanks,
 Hyunsik

 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.

 
 Tajo Proposal

 = Abstract =

 Tajo is a distributed data warehouse system for Hadoop.


 = Proposal =

 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.


 = Background =

 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.


 = Rationale =

 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.


 = Initial Goals =

 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.


 = Current Status =

 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.


 == Meritocracy ==

 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.


 == Community ==

 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.


 == Core Developers ==

 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.

  * Eli Reisman ereisman AT apache DOT org

  * Henry Saputra hsaputra AT apache DOT org

  * Hyunsik Choi hyunsik AT apache DOT org

  * Jae Hwa Jung jhjung AT gruter DOT com

  * Jihoon Son ghoonson AT gmail DOT com

  * Jin Ho Kim jhkim AT gruter DOT com

  * Roshan Sumbaly rsumbaly AT gmail DOT com

  * Sangwook Kim swkim AT inervit DOT com

  * Yi A Liu yi DOT a DOT liu AT intel DOT com


 == Alignment ==

 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and these modules can also be
 used for 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Matthias Friedrich
+1 (non-binding)

Looks really interesting, good luck!

Regards,
  Matthias

On Friday, 2013-03-01, Hyunsik Choi wrote:
 Hi Folks,
 
 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).
 
 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...
 
 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.
 
 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.
 
 Thanks,
 Hyunsik
 
 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.
 
 
 Tajo Proposal
 
 = Abstract =
 
 Tajo is a distributed data warehouse system for Hadoop.
 
 
 = Proposal =
 
 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
 = Background =
 
 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
 = Rationale =
 
 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.
 
 
 = Initial Goals =
 
 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.
 
 
 = Current Status =
 
 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
 == Meritocracy ==
 
 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.
 
 
 == Community ==
 
 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.
 
 
 == Core Developers ==
 
 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.
 
  * Eli Reisman ereisman AT apache DOT org
 
  * Henry Saputra hsaputra AT apache DOT org
 
  * Hyunsik Choi hyunsik AT apache DOT org
 
  * Jae Hwa Jung jhjung AT gruter DOT com
 
  * Jihoon Son ghoonson AT gmail DOT com
 
  * Jin Ho Kim jhkim AT gruter DOT com
 
  * Roshan Sumbaly rsumbaly AT gmail DOT com
 
  * Sangwook Kim swkim AT inervit DOT com
 
  * Yi A Liu yi DOT a DOT liu AT intel DOT com
 
 
 == Alignment ==
 
 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Chris Douglas
+1 (binding) -C

On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:
 Hi Folks,

 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).

 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...

 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.

 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.

 Thanks,
 Hyunsik

 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.

 
 Tajo Proposal

 = Abstract =

 Tajo is a distributed data warehouse system for Hadoop.


 = Proposal =

 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.


 = Background =

 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.


 = Rationale =

 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.


 = Initial Goals =

 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.


 = Current Status =

 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.


 == Meritocracy ==

 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.


 == Community ==

 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.


 == Core Developers ==

 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.

  * Eli Reisman ereisman AT apache DOT org

  * Henry Saputra hsaputra AT apache DOT org

  * Hyunsik Choi hyunsik AT apache DOT org

  * Jae Hwa Jung jhjung AT gruter DOT com

  * Jihoon Son ghoonson AT gmail DOT com

  * Jin Ho Kim jhkim AT gruter DOT com

  * Roshan Sumbaly rsumbaly AT gmail DOT com

  * Sangwook Kim swkim AT inervit DOT com

  * Yi A Liu yi DOT a DOT liu AT intel DOT com


 == Alignment ==

 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and these modules can also be
 used for other projects.


 = Known Risks =

 == 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Henry Saputra
+1 (non-binding)


- Henry


On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:

 Hi Folks,

 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).

 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...

 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.

 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.

 Thanks,
 Hyunsik

 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.

 
 Tajo Proposal

 = Abstract =

 Tajo is a distributed data warehouse system for Hadoop.


 = Proposal =

 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.


 = Background =

 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.


 = Rationale =

 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.


 = Initial Goals =

 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.


 = Current Status =

 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.


 == Meritocracy ==

 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.


 == Community ==

 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.


 == Core Developers ==

 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.

  * Eli Reisman ereisman AT apache DOT org

  * Henry Saputra hsaputra AT apache DOT org

  * Hyunsik Choi hyunsik AT apache DOT org

  * Jae Hwa Jung jhjung AT gruter DOT com

  * Jihoon Son ghoonson AT gmail DOT com

  * Jin Ho Kim jhkim AT gruter DOT com

  * Roshan Sumbaly rsumbaly AT gmail DOT com

  * Sangwook Kim swkim AT inervit DOT com

  * Yi A Liu yi DOT a DOT liu AT intel DOT com


 == Alignment ==

 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and these modules can also be
 used for other projects.


 = Known 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Jean-Baptiste Onofré

+1 (binding)

Regards
JB

On 02/28/2013 09:13 PM, Henry Saputra wrote:

+1 (non-binding)


- Henry


On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:


Hi Folks,

I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
The vote will close on Mar 7 at 6:00 PM (PST).

[] +1 Accept Tajo into the Apache incubator
[] +0 Don't care.
[] -1 Don't accept Tajo into the incubator because...

Full proposal is pasted at the bottom on this email, and the corresponding
wiki is http://wiki.apache.org/incubator/TajoProposal.

Only VOTEs from Incubator PMC members are binding, but all are welcome to
express their thoughts.

Thanks,
Hyunsik

PS: From the initial discussion, the main changes are that I've added 4 new
committers. Also, I've revised some description of Known Risks because the
initial committers have been diverse.


Tajo Proposal

= Abstract =

Tajo is a distributed data warehouse system for Hadoop.


= Proposal =

Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
is designed for low-latency and scalable ad-hoc queries, online aggregation
and ETL on large-data sets by leveraging advanced database techniques. It
supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
and it has its own query engine which allows direct control of distributed
execution and data flow. As a result, Tajo has a variety of query
evaluation strategies and more optimization opportunities. In addition,
Tajo will have a native columnar execution and and its optimizer. Tajo will
be an alternative choice to Hive/Pig on the top of MapReduce.


= Background =

Big data analysis has gained much attention in the industrial. Open source
communities have proposed scalable and distributed solutions for ad-hoc
queries on big data. However, there is still room for improvement. Markets
need more faster and efficient solutions. Recently, some alternatives
(e.g., Cloudera's Impala and Amazon Redshift) have come out.


= Rationale =

There are a variety of open source distributed execution engines (e.g.,
hive, and pig) running on the top of MapReduce. They are limited by MR
framework. They cannot directly control distributed execution and data
flow, and they just use MR framework. So, they have limited query
evaluation strategies and optimization opportunities. It is hard for them
to be optimized for a certain type of data processing.


= Initial Goals =

The initial goal is to write more documents to describe Tajo's internal. It
will be helpful to recruit more committers and to build a solid community.
Then, we will make milestones for short/long term plans.


= Current Status =

Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
selection, projection, group-by, join, union and sort) except for nested
queries. Tajo provides various row/column storage formats, such as CSV,
RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
also has a rudimentary ETL feature to transform one data format to another
data format. In addition, Tajo provides hash and range repartitions. By
using both repartition methods, Tajo processes aggregation, join, and sort
queries over a number of cluster nodes. To evaluate the performance, we
have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.


== Meritocracy ==

We will discuss the milestone and the future plan in an open forum. We plan
to encourage an environment that supports a meritocracy. The contributors
will have different privileges according to their contributions.


== Community ==

Big data analysis has gained attention from open source communities,
industrial and academic areas. Some projects related to Hadoop already have
very large and active communities. We expect that Tajo also will establish
an active community. Since Tajo already works for some features and is in
the alpha stage, it will attract a large community soon.


== Core Developers ==

Core developers are a diverse group of developers, many of which are very
experienced in open source and the Apache Hadoop ecosystem.

  * Eli Reisman ereisman AT apache DOT org

  * Henry Saputra hsaputra AT apache DOT org

  * Hyunsik Choi hyunsik AT apache DOT org

  * Jae Hwa Jung jhjung AT gruter DOT com

  * Jihoon Son ghoonson AT gmail DOT com

  * Jin Ho Kim jhkim AT gruter DOT com

  * Roshan Sumbaly rsumbaly AT gmail DOT com

  * Sangwook Kim swkim AT inervit DOT com

  * Yi A Liu yi DOT a DOT liu AT intel DOT com


== Alignment ==

Tajo employs Apache Hadoop Yarn as a resource management platform for large
clusters. It uses HDFS as a primary storage layer. It already supports
Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
addition, we have a plan to integrate Tajo with other products of Hadoop
ecosystem. Tajo's modules are well organized, and these modules can also be
used for other projects.


= Known Risks =

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Ted Dunning
+1 (binding)

On Thu, Feb 28, 2013 at 11:52 AM, Matthias Friedrich m...@mafr.de wrote:

 +1 (non-binding)

 Looks really interesting, good luck!

 Regards,
   Matthias

 On Friday, 2013-03-01, Hyunsik Choi wrote:
  Hi Folks,
 
  I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
  The vote will close on Mar 7 at 6:00 PM (PST).
 
  [] +1 Accept Tajo into the Apache incubator
  [] +0 Don't care.
  [] -1 Don't accept Tajo into the incubator because...
 
  Full proposal is pasted at the bottom on this email, and the
 corresponding
  wiki is http://wiki.apache.org/incubator/TajoProposal.
 
  Only VOTEs from Incubator PMC members are binding, but all are welcome to
  express their thoughts.
 
  Thanks,
  Hyunsik
 
  PS: From the initial discussion, the main changes are that I've added 4
 new
  committers. Also, I've revised some description of Known Risks because
 the
  initial committers have been diverse.
 
  
  Tajo Proposal
 
  = Abstract =
 
  Tajo is a distributed data warehouse system for Hadoop.
 
 
  = Proposal =
 
  Tajo is a relational and distributed data warehouse system for Hadoop.
 Tajo
  is designed for low-latency and scalable ad-hoc queries, online
 aggregation
  and ETL on large-data sets by leveraging advanced database techniques. It
  supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
  Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
  and it has its own query engine which allows direct control of
 distributed
  execution and data flow. As a result, Tajo has a variety of query
  evaluation strategies and more optimization opportunities. In addition,
  Tajo will have a native columnar execution and and its optimizer. Tajo
 will
  be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
  = Background =
 
  Big data analysis has gained much attention in the industrial. Open
 source
  communities have proposed scalable and distributed solutions for ad-hoc
  queries on big data. However, there is still room for improvement.
 Markets
  need more faster and efficient solutions. Recently, some alternatives
  (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
  = Rationale =
 
  There are a variety of open source distributed execution engines (e.g.,
  hive, and pig) running on the top of MapReduce. They are limited by MR
  framework. They cannot directly control distributed execution and data
  flow, and they just use MR framework. So, they have limited query
  evaluation strategies and optimization opportunities. It is hard for them
  to be optimized for a certain type of data processing.
 
 
  = Initial Goals =
 
  The initial goal is to write more documents to describe Tajo's internal.
 It
  will be helpful to recruit more committers and to build a solid
 community.
  Then, we will make milestones for short/long term plans.
 
 
  = Current Status =
 
  Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
  selection, projection, group-by, join, union and sort) except for nested
  queries. Tajo provides various row/column storage formats, such as CSV,
  RowFile (a row-store file we have implemented), RCFile, and Trevni, and
 it
  also has a rudimentary ETL feature to transform one data format to
 another
  data format. In addition, Tajo provides hash and range repartitions. By
  using both repartition methods, Tajo processes aggregation, join, and
 sort
  queries over a number of cluster nodes. To evaluate the performance, we
  have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
  == Meritocracy ==
 
  We will discuss the milestone and the future plan in an open forum. We
 plan
  to encourage an environment that supports a meritocracy. The contributors
  will have different privileges according to their contributions.
 
 
  == Community ==
 
  Big data analysis has gained attention from open source communities,
  industrial and academic areas. Some projects related to Hadoop already
 have
  very large and active communities. We expect that Tajo also will
 establish
  an active community. Since Tajo already works for some features and is in
  the alpha stage, it will attract a large community soon.
 
 
  == Core Developers ==
 
  Core developers are a diverse group of developers, many of which are very
  experienced in open source and the Apache Hadoop ecosystem.
 
   * Eli Reisman ereisman AT apache DOT org
 
   * Henry Saputra hsaputra AT apache DOT org
 
   * Hyunsik Choi hyunsik AT apache DOT org
 
   * Jae Hwa Jung jhjung AT gruter DOT com
 
   * Jihoon Son ghoonson AT gmail DOT com
 
   * Jin Ho Kim jhkim AT gruter DOT com
 
   * Roshan Sumbaly rsumbaly AT gmail DOT com
 
   * Sangwook Kim swkim AT inervit DOT com
 
   * Yi A Liu yi DOT a DOT liu AT intel DOT com
 
 
  == Alignment ==
 
  Tajo employs Apache Hadoop Yarn as a resource management platform for
 large
  clusters. It uses HDFS as a primary storage layer. It already supports
  

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Alan Cabrera
+1

Regards,
Alan

On Feb 28, 2013, at 10:11 AM, Hyunsik Choi wrote:

 Hi Folks,
 
 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).
 
 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...
 
 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.
 
 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.
 
 Thanks,
 Hyunsik
 
 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.
 
 
 Tajo Proposal
 
 = Abstract =
 
 Tajo is a distributed data warehouse system for Hadoop.
 
 
 = Proposal =
 
 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
 = Background =
 
 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
 = Rationale =
 
 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.
 
 
 = Initial Goals =
 
 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.
 
 
 = Current Status =
 
 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
 == Meritocracy ==
 
 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.
 
 
 == Community ==
 
 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.
 
 
 == Core Developers ==
 
 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.
 
 * Eli Reisman ereisman AT apache DOT org
 
 * Henry Saputra hsaputra AT apache DOT org
 
 * Hyunsik Choi hyunsik AT apache DOT org
 
 * Jae Hwa Jung jhjung AT gruter DOT com
 
 * Jihoon Son ghoonson AT gmail DOT com
 
 * Jin Ho Kim jhkim AT gruter DOT com
 
 * Roshan Sumbaly rsumbaly AT gmail DOT com
 
 * Sangwook Kim swkim AT inervit DOT com
 
 * Yi A Liu yi DOT a DOT liu AT intel DOT com
 
 
 == Alignment ==
 
 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and these modules can also be
 used for other projects.
 
 
 = 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Sharad Agarwal
+1 (non-binding)

On Thu, Feb 28, 2013 at 11:41 PM, Hyunsik Choi hyun...@apache.org wrote:

 Hi Folks,

 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).

 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...

 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.

 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.

 Thanks,
 Hyunsik

 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.

 
 Tajo Proposal

 = Abstract =

 Tajo is a distributed data warehouse system for Hadoop.


 = Proposal =

 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.


 = Background =

 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.


 = Rationale =

 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.


 = Initial Goals =

 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.


 = Current Status =

 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.


 == Meritocracy ==

 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.


 == Community ==

 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.


 == Core Developers ==

 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.

  * Eli Reisman ereisman AT apache DOT org

  * Henry Saputra hsaputra AT apache DOT org

  * Hyunsik Choi hyunsik AT apache DOT org

  * Jae Hwa Jung jhjung AT gruter DOT com

  * Jihoon Son ghoonson AT gmail DOT com

  * Jin Ho Kim jhkim AT gruter DOT com

  * Roshan Sumbaly rsumbaly AT gmail DOT com

  * Sangwook Kim swkim AT inervit DOT com

  * Yi A Liu yi DOT a DOT liu AT intel DOT com


 == Alignment ==

 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and these modules can also be
 used for other projects.


 = Known Risks =

 == 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Suresh Marru
+ 1 (binding).

Happy Incubating,
Suresh

On Feb 28, 2013, at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:

 Hi Folks,
 
 I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
 The vote will close on Mar 7 at 6:00 PM (PST).
 
 [] +1 Accept Tajo into the Apache incubator
 [] +0 Don't care.
 [] -1 Don't accept Tajo into the incubator because...
 
 Full proposal is pasted at the bottom on this email, and the corresponding
 wiki is http://wiki.apache.org/incubator/TajoProposal.
 
 Only VOTEs from Incubator PMC members are binding, but all are welcome to
 express their thoughts.
 
 Thanks,
 Hyunsik
 
 PS: From the initial discussion, the main changes are that I've added 4 new
 committers. Also, I've revised some description of Known Risks because the
 initial committers have been diverse.
 
 
 Tajo Proposal
 
 = Abstract =
 
 Tajo is a distributed data warehouse system for Hadoop.
 
 
 = Proposal =
 
 Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
 is designed for low-latency and scalable ad-hoc queries, online aggregation
 and ETL on large-data sets by leveraging advanced database techniques. It
 supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
 Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
 and it has its own query engine which allows direct control of distributed
 execution and data flow. As a result, Tajo has a variety of query
 evaluation strategies and more optimization opportunities. In addition,
 Tajo will have a native columnar execution and and its optimizer. Tajo will
 be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
 = Background =
 
 Big data analysis has gained much attention in the industrial. Open source
 communities have proposed scalable and distributed solutions for ad-hoc
 queries on big data. However, there is still room for improvement. Markets
 need more faster and efficient solutions. Recently, some alternatives
 (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
 = Rationale =
 
 There are a variety of open source distributed execution engines (e.g.,
 hive, and pig) running on the top of MapReduce. They are limited by MR
 framework. They cannot directly control distributed execution and data
 flow, and they just use MR framework. So, they have limited query
 evaluation strategies and optimization opportunities. It is hard for them
 to be optimized for a certain type of data processing.
 
 
 = Initial Goals =
 
 The initial goal is to write more documents to describe Tajo's internal. It
 will be helpful to recruit more committers and to build a solid community.
 Then, we will make milestones for short/long term plans.
 
 
 = Current Status =
 
 Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
 selection, projection, group-by, join, union and sort) except for nested
 queries. Tajo provides various row/column storage formats, such as CSV,
 RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
 also has a rudimentary ETL feature to transform one data format to another
 data format. In addition, Tajo provides hash and range repartitions. By
 using both repartition methods, Tajo processes aggregation, join, and sort
 queries over a number of cluster nodes. To evaluate the performance, we
 have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
 == Meritocracy ==
 
 We will discuss the milestone and the future plan in an open forum. We plan
 to encourage an environment that supports a meritocracy. The contributors
 will have different privileges according to their contributions.
 
 
 == Community ==
 
 Big data analysis has gained attention from open source communities,
 industrial and academic areas. Some projects related to Hadoop already have
 very large and active communities. We expect that Tajo also will establish
 an active community. Since Tajo already works for some features and is in
 the alpha stage, it will attract a large community soon.
 
 
 == Core Developers ==
 
 Core developers are a diverse group of developers, many of which are very
 experienced in open source and the Apache Hadoop ecosystem.
 
 * Eli Reisman ereisman AT apache DOT org
 
 * Henry Saputra hsaputra AT apache DOT org
 
 * Hyunsik Choi hyunsik AT apache DOT org
 
 * Jae Hwa Jung jhjung AT gruter DOT com
 
 * Jihoon Son ghoonson AT gmail DOT com
 
 * Jin Ho Kim jhkim AT gruter DOT com
 
 * Roshan Sumbaly rsumbaly AT gmail DOT com
 
 * Sangwook Kim swkim AT inervit DOT com
 
 * Yi A Liu yi DOT a DOT liu AT intel DOT com
 
 
 == Alignment ==
 
 Tajo employs Apache Hadoop Yarn as a resource management platform for large
 clusters. It uses HDFS as a primary storage layer. It already supports
 Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
 addition, we have a plan to integrate Tajo with other products of Hadoop
 ecosystem. Tajo's modules are well organized, and these modules can 

Re: [VOTE] Accept Tajo into the Apache Incubator

2013-02-28 Thread Owen O'Malley
+1 (binding)


On Thu, Feb 28, 2013 at 10:34 PM, Suresh Marru sma...@apache.org wrote:

 + 1 (binding).

 Happy Incubating,
 Suresh

 On Feb 28, 2013, at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote:

  Hi Folks,
 
  I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
  The vote will close on Mar 7 at 6:00 PM (PST).
 
  [] +1 Accept Tajo into the Apache incubator
  [] +0 Don't care.
  [] -1 Don't accept Tajo into the incubator because...
 
  Full proposal is pasted at the bottom on this email, and the
 corresponding
  wiki is http://wiki.apache.org/incubator/TajoProposal.
 
  Only VOTEs from Incubator PMC members are binding, but all are welcome to
  express their thoughts.
 
  Thanks,
  Hyunsik
 
  PS: From the initial discussion, the main changes are that I've added 4
 new
  committers. Also, I've revised some description of Known Risks because
 the
  initial committers have been diverse.
 
  
  Tajo Proposal
 
  = Abstract =
 
  Tajo is a distributed data warehouse system for Hadoop.
 
 
  = Proposal =
 
  Tajo is a relational and distributed data warehouse system for Hadoop.
 Tajo
  is designed for low-latency and scalable ad-hoc queries, online
 aggregation
  and ETL on large-data sets by leveraging advanced database techniques. It
  supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
  Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
  and it has its own query engine which allows direct control of
 distributed
  execution and data flow. As a result, Tajo has a variety of query
  evaluation strategies and more optimization opportunities. In addition,
  Tajo will have a native columnar execution and and its optimizer. Tajo
 will
  be an alternative choice to Hive/Pig on the top of MapReduce.
 
 
  = Background =
 
  Big data analysis has gained much attention in the industrial. Open
 source
  communities have proposed scalable and distributed solutions for ad-hoc
  queries on big data. However, there is still room for improvement.
 Markets
  need more faster and efficient solutions. Recently, some alternatives
  (e.g., Cloudera's Impala and Amazon Redshift) have come out.
 
 
  = Rationale =
 
  There are a variety of open source distributed execution engines (e.g.,
  hive, and pig) running on the top of MapReduce. They are limited by MR
  framework. They cannot directly control distributed execution and data
  flow, and they just use MR framework. So, they have limited query
  evaluation strategies and optimization opportunities. It is hard for them
  to be optimized for a certain type of data processing.
 
 
  = Initial Goals =
 
  The initial goal is to write more documents to describe Tajo's internal.
 It
  will be helpful to recruit more committers and to build a solid
 community.
  Then, we will make milestones for short/long term plans.
 
 
  = Current Status =
 
  Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
  selection, projection, group-by, join, union and sort) except for nested
  queries. Tajo provides various row/column storage formats, such as CSV,
  RowFile (a row-store file we have implemented), RCFile, and Trevni, and
 it
  also has a rudimentary ETL feature to transform one data format to
 another
  data format. In addition, Tajo provides hash and range repartitions. By
  using both repartition methods, Tajo processes aggregation, join, and
 sort
  queries over a number of cluster nodes. To evaluate the performance, we
  have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
 
 
  == Meritocracy ==
 
  We will discuss the milestone and the future plan in an open forum. We
 plan
  to encourage an environment that supports a meritocracy. The contributors
  will have different privileges according to their contributions.
 
 
  == Community ==
 
  Big data analysis has gained attention from open source communities,
  industrial and academic areas. Some projects related to Hadoop already
 have
  very large and active communities. We expect that Tajo also will
 establish
  an active community. Since Tajo already works for some features and is in
  the alpha stage, it will attract a large community soon.
 
 
  == Core Developers ==
 
  Core developers are a diverse group of developers, many of which are very
  experienced in open source and the Apache Hadoop ecosystem.
 
  * Eli Reisman ereisman AT apache DOT org
 
  * Henry Saputra hsaputra AT apache DOT org
 
  * Hyunsik Choi hyunsik AT apache DOT org
 
  * Jae Hwa Jung jhjung AT gruter DOT com
 
  * Jihoon Son ghoonson AT gmail DOT com
 
  * Jin Ho Kim jhkim AT gruter DOT com
 
  * Roshan Sumbaly rsumbaly AT gmail DOT com
 
  * Sangwook Kim swkim AT inervit DOT com
 
  * Yi A Liu yi DOT a DOT liu AT intel DOT com
 
 
  == Alignment ==
 
  Tajo employs Apache Hadoop Yarn as a resource management platform for
 large
  clusters. It uses HDFS as a primary storage layer. It already supports
  Hadoop-related data formats