Re: [VOTE] Accept Tajo into the Apache Incubator
With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you start the process for setting up the project? Thanks, all. -- Owen On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org wrote: On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org wrote: +1 (binding) Just as an FYI, I'm also a mentor of this project. -- Best Regards, -- Alex
Re: [VOTE] Accept Tajo into the Apache Incubator
Will do. This is going to be fun. On Thu, Mar 7, 2013 at 4:17 PM, Owen O'Malley omal...@apache.org wrote: With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you start the process for setting up the project? Thanks, all. -- Owen On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org wrote: On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org wrote: +1 (binding) Just as an FYI, I'm also a mentor of this project. -- Best Regards, -- Alex
Re: [VOTE] Accept Tajo into the Apache Incubator
Congratz guys! On 3/8/2013 9:17 AM, Owen O'Malley wrote: With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you start the process for setting up the project? Thanks, all. -- Owen On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org wrote: On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org wrote: +1 (binding) Just as an FYI, I'm also a mentor of this project. -- Best Regards, -- Alex -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Tajo into the Apache Incubator
Indeed congratulations everyone! On Fri, Mar 8, 2013 at 3:50 AM, edward yoon edward.y...@oracle.com wrote: Congratz guys! On 3/8/2013 9:17 AM, Owen O'Malley wrote: With 11 binding +1's and 5 non-binding +1's the vote passes. Jakob, can you start the process for setting up the project? Thanks, all. -- Owen On Mon, Mar 4, 2013 at 4:35 PM, Alex Karasulu akaras...@apache.org wrote: On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org wrote: +1 (binding) Just as an FYI, I'm also a mentor of this project. -- Best Regards, -- Alex -- Best Regards, Edward J. Yoon @eddieyoon --**--**- To unsubscribe, e-mail: general-unsubscribe@incubator.**apache.orggeneral-unsubscr...@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.**orggeneral-h...@incubator.apache.org -- Best Regards, -- Alex
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding) from me. Cheers, Chris On 2/28/13 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for other projects. = Known Risks = == Orphaned Products == Most of codes have been developed by only two core
Re: [VOTE] Accept Tajo into the Apache Incubator
On Sat, Mar 2, 2013 at 3:48 AM, Alex Karasulu akaras...@apache.org wrote: +1 (binding) Just as an FYI, I'm also a mentor of this project. -- Best Regards, -- Alex
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (non binding) Would be interested in helping out with HBase integration, and Bigtop packaging. On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.orgjavascript:; For additional commands, e-mail: general-h...@incubator.apache.orgjavascript:; -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: [VOTE] Accept Tajo into the Apache Incubator
On 28 February 2013 18:11, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [X] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... +1, binding. It'll not only move the hadoop stack up, but act as more regression tests to the layers below. And test are always welcome
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding) On Sat, Mar 2, 2013 at 2:16 AM, Roman Shaposhnik ro...@shaposhnik.orgwrote: +1 (binding). I would also encourage you guys to take a look at Apache Bigtop as a way of integrating with the rest of Hadoop ecosystem and bring more testing into the fold. Looking forward to working with you! Thanks, Roman. On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding). This is a great addition to the Incubator. On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (non-binding) Looks really interesting, good luck! Regards, Matthias On Friday, 2013-03-01, Hyunsik Choi wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding) -C On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for other projects. = Known Risks = ==
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (non-binding) - Henry On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for other projects. = Known
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding) Regards JB On 02/28/2013 09:13 PM, Henry Saputra wrote: +1 (non-binding) - Henry On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for other projects. = Known Risks =
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding) On Thu, Feb 28, 2013 at 11:52 AM, Matthias Friedrich m...@mafr.de wrote: +1 (non-binding) Looks really interesting, good luck! Regards, Matthias On Friday, 2013-03-01, Hyunsik Choi wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 Regards, Alan On Feb 28, 2013, at 10:11 AM, Hyunsik Choi wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for other projects. =
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (non-binding) On Thu, Feb 28, 2013 at 11:41 PM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also be used for other projects. = Known Risks = ==
Re: [VOTE] Accept Tajo into the Apache Incubator
+ 1 (binding). Happy Incubating, Suresh On Feb 28, 2013, at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can
Re: [VOTE] Accept Tajo into the Apache Incubator
+1 (binding) On Thu, Feb 28, 2013 at 10:34 PM, Suresh Marru sma...@apache.org wrote: + 1 (binding). Happy Incubating, Suresh On Feb 28, 2013, at 10:11 AM, Hyunsik Choi hyun...@apache.org wrote: Hi Folks, I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. The vote will close on Mar 7 at 6:00 PM (PST). [] +1 Accept Tajo into the Apache incubator [] +0 Don't care. [] -1 Don't accept Tajo into the incubator because... Full proposal is pasted at the bottom on this email, and the corresponding wiki is http://wiki.apache.org/incubator/TajoProposal. Only VOTEs from Incubator PMC members are binding, but all are welcome to express their thoughts. Thanks, Hyunsik PS: From the initial discussion, the main changes are that I've added 4 new committers. Also, I've revised some description of Known Risks because the initial committers have been diverse. Tajo Proposal = Abstract = Tajo is a distributed data warehouse system for Hadoop. = Proposal = Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, and it has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig on the top of MapReduce. = Background = Big data analysis has gained much attention in the industrial. Open source communities have proposed scalable and distributed solutions for ad-hoc queries on big data. However, there is still room for improvement. Markets need more faster and efficient solutions. Recently, some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out. = Rationale = There are a variety of open source distributed execution engines (e.g., hive, and pig) running on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed execution and data flow, and they just use MR framework. So, they have limited query evaluation strategies and optimization opportunities. It is hard for them to be optimized for a certain type of data processing. = Initial Goals = The initial goal is to write more documents to describe Tajo's internal. It will be helpful to recruit more committers and to build a solid community. Then, we will make milestones for short/long term plans. = Current Status = Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., selection, projection, group-by, join, union and sort) except for nested queries. Tajo provides various row/column storage formats, such as CSV, RowFile (a row-store file we have implemented), RCFile, and Trevni, and it also has a rudimentary ETL feature to transform one data format to another data format. In addition, Tajo provides hash and range repartitions. By using both repartition methods, Tajo processes aggregation, join, and sort queries over a number of cluster nodes. To evaluate the performance, we have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. == Meritocracy == We will discuss the milestone and the future plan in an open forum. We plan to encourage an environment that supports a meritocracy. The contributors will have different privileges according to their contributions. == Community == Big data analysis has gained attention from open source communities, industrial and academic areas. Some projects related to Hadoop already have very large and active communities. We expect that Tajo also will establish an active community. Since Tajo already works for some features and is in the alpha stage, it will attract a large community soon. == Core Developers == Core developers are a diverse group of developers, many of which are very experienced in open source and the Apache Hadoop ecosystem. * Eli Reisman ereisman AT apache DOT org * Henry Saputra hsaputra AT apache DOT org * Hyunsik Choi hyunsik AT apache DOT org * Jae Hwa Jung jhjung AT gruter DOT com * Jihoon Son ghoonson AT gmail DOT com * Jin Ho Kim jhkim AT gruter DOT com * Roshan Sumbaly rsumbaly AT gmail DOT com * Sangwook Kim swkim AT inervit DOT com * Yi A Liu yi DOT a DOT liu AT intel DOT com == Alignment == Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses HDFS as a primary storage layer. It already supports Hadoop-related data formats