Re: [PROPOSAL] MRQL for the Apache Incubator
I think it's time to call for vote. On Mon, Mar 4, 2013 at 9:25 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Nice proposal indeed, I'd say having 3 mentors is usually better to avoid release headaches. Regards, Tommaso 2013/3/4 Edward J. Yoon edwardy...@apache.org Sure I can. :) Of course, we'll welcome more mentors from incubator IPMC if there're volunteers. On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org wrote: On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: == Champion == * Edward J. Yoon edwardyoon AT apache DOT org == Nominated Mentors == * Alex Karasulu akarasulu AT apache DOT org ... Is Edward going to stay on as a mentor as well? Two (active) mentors is the bare minimum IMO. I suspect so but let's hear from Edward himself. Best Regards, -- Alex -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] MRQL for the Apache Incubator
I added myself as a mentor. Welcome aboard. On Wed, Mar 6, 2013 at 9:02 AM, Edward J. Yoon edwardy...@apache.orgwrote: I think it's time to call for vote. On Mon, Mar 4, 2013 at 9:25 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Nice proposal indeed, I'd say having 3 mentors is usually better to avoid release headaches. Regards, Tommaso 2013/3/4 Edward J. Yoon edwardy...@apache.org Sure I can. :) Of course, we'll welcome more mentors from incubator IPMC if there're volunteers. On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org wrote: On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: == Champion == * Edward J. Yoon edwardyoon AT apache DOT org == Nominated Mentors == * Alex Karasulu akarasulu AT apache DOT org ... Is Edward going to stay on as a mentor as well? Two (active) mentors is the bare minimum IMO. I suspect so but let's hear from Edward himself. Best Regards, -- Alex -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Thanks - Mohammad Nour Life is like riding a bicycle. To keep your balance you must keep moving - Albert Einstein
Re: [PROPOSAL] MRQL for the Apache Incubator
On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: == Champion == * Edward J. Yoon edwardyoon AT apache DOT org == Nominated Mentors == * Alex Karasulu akarasulu AT apache DOT org ... Is Edward going to stay on as a mentor as well? Two (active) mentors is the bare minimum IMO. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] MRQL for the Apache Incubator
On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: == Champion == * Edward J. Yoon edwardyoon AT apache DOT org == Nominated Mentors == * Alex Karasulu akarasulu AT apache DOT org ... Is Edward going to stay on as a mentor as well? Two (active) mentors is the bare minimum IMO. I suspect so but let's hear from Edward himself. Best Regards, -- Alex
Re: [PROPOSAL] MRQL for the Apache Incubator
Sure I can. :) Of course, we'll welcome more mentors from incubator IPMC if there're volunteers. On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org wrote: On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: == Champion == * Edward J. Yoon edwardyoon AT apache DOT org == Nominated Mentors == * Alex Karasulu akarasulu AT apache DOT org ... Is Edward going to stay on as a mentor as well? Two (active) mentors is the bare minimum IMO. I suspect so but let's hear from Edward himself. Best Regards, -- Alex -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] MRQL for the Apache Incubator
Nice proposal indeed, I'd say having 3 mentors is usually better to avoid release headaches. Regards, Tommaso 2013/3/4 Edward J. Yoon edwardy...@apache.org Sure I can. :) Of course, we'll welcome more mentors from incubator IPMC if there're volunteers. On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org wrote: On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: == Champion == * Edward J. Yoon edwardyoon AT apache DOT org == Nominated Mentors == * Alex Karasulu akarasulu AT apache DOT org ... Is Edward going to stay on as a mentor as well? Two (active) mentors is the bare minimum IMO. I suspect so but let's hear from Edward himself. Best Regards, -- Alex -- Best Regards, Edward J. Yoon @eddieyoon - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
[PROPOSAL] MRQL for the Apache Incubator
Dear ASF members, We would like to propose a new project to the incubator, called MRQL. Edward J. Yoon has volunteered to be the champion for this project. The proposal draft is available at: http://wiki.apache.org/incubator/MRQLProposal We are very excited about having this opportunity to work with ASF to create an incubator project. We are looking forward to your feedback and suggestions. Best regards Leonidas Fegaras = Abstract = MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop and Hama. = Proposal = MRQL (pronounced ''miracle'') is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in two modes: in MapReduce mode on top of Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of Apache Hama. The MRQL query language is powerful enough to express most common data analysis tasks over many forms of raw ''in-situ'' data, such as XML and JSON documents, binary files, and CSV documents. MRQL is more powerful than other current high-level MapReduce languages, such as Hive and PigLatin, since it can operate on more complex data and supports more powerful query constructs, thus eliminating the need for using explicit MapReduce code. With MRQL, users will be able to express complex data analysis tasks, such as PageRank, k-means clustering, matrix factorization, etc, using SQL-like queries exclusively, while the MRQL query processing system will be able to compile these queries to efficient Java code. = Background = The initial code was developed at the University of Texas of Arlington (UTA) by a research team, led by Leonidas Fegaras. The software was first released in May 2011. The original goal of this project was to build a query processing system that translates SQL-like data analysis queries to efficient workflows of MapReduce jobs. A design goal was to use HDFS as the physical storage layer, without any indexing, data partitioning, or data normalization, and to use Hadoop (without extensions) as the run-time engine. The motivation behind this work was to built a platform to test new ideas on query processing and optimization techniques applicable to the MapReduce framework. A year ago, MRQL was extended to run on Hama. The motivation for this extension was that Hadoop MapReduce jobs were required to read their input and write their output on HDFS. This simplifies reliability and fault tolerance but it imposes a high overhead to complex MapReduce workflows and graph algorithms, such as PageRank, which require repetitive jobs. In addition, Hadoop does not preserve data in memory across consecutive MapReduce jobs. This restriction requires to read data at every step, even when the data is constant. BSP, on the other hand, does not suffer from this restriction, and, under certain circumstances, allows complex repetitive algorithms to run entirely in the collective memory of a cluster. Thus, the goal was to be able to run the same MRQL queries in both modes, MapReduce and BSP, without modifying the queries: If there are enough resources available, and low latency and speed are more important than resilience, queries may run in BSP mode; otherwise, the same queries may run in MapReduce mode. BSP evaluation was found to be a good choice when fault tolerance is not critical, data (both input and intermediate) can fit in the cluster memory, and data processing requires complex/repetitive steps. The research results of this ongoing work have already been published in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors have already received positive feedback from researchers in academia and industry who were attending these conferences. = Rationale = * MRQL will be the first general-purpose, SQL-like query language for data analysis based on BSP. Currently, many programmers prefer to code their MapReduce applications in a higher-level query language, rather than an algorithmic language. For instance, Pig is used for 60% of Yahoo MapReduce jobs, while Hive is used for 90% of Facebook MapReduce jobs. This, we believe, will also be the trend for BSP applications, because, even though, in principle, the BSP model is very simple to understand, it is hard to develop, optimize, and maintain non-trivial BSP applications coded in a general-purpose programming language. Currently, there is no widely acceptable declarative BSP query language, although there are a few special-purpose BSP systems for graph analysis, such as Google Pregel and Apache Giraph, for machine learning, such as BSML, and for scientific data analysis. * MRQL can capture many complex data analysis algorithms in declarative form. Existing MapReduce query languages, such as HiveQL and PigLatin, provide a limited syntax for
Re: [PROPOSAL] MRQL for the Apache Incubator
Sounds awesome guys look forward to the VOTE. Cheers, Chris On 3/2/13 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: Dear ASF members, We would like to propose a new project to the incubator, called MRQL. Edward J. Yoon has volunteered to be the champion for this project. The proposal draft is available at: http://wiki.apache.org/incubator/MRQLProposal We are very excited about having this opportunity to work with ASF to create an incubator project. We are looking forward to your feedback and suggestions. Best regards Leonidas Fegaras = Abstract = MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop and Hama. = Proposal = MRQL (pronounced ''miracle'') is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in two modes: in MapReduce mode on top of Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of Apache Hama. The MRQL query language is powerful enough to express most common data analysis tasks over many forms of raw ''in-situ'' data, such as XML and JSON documents, binary files, and CSV documents. MRQL is more powerful than other current high-level MapReduce languages, such as Hive and PigLatin, since it can operate on more complex data and supports more powerful query constructs, thus eliminating the need for using explicit MapReduce code. With MRQL, users will be able to express complex data analysis tasks, such as PageRank, k-means clustering, matrix factorization, etc, using SQL-like queries exclusively, while the MRQL query processing system will be able to compile these queries to efficient Java code. = Background = The initial code was developed at the University of Texas of Arlington (UTA) by a research team, led by Leonidas Fegaras. The software was first released in May 2011. The original goal of this project was to build a query processing system that translates SQL-like data analysis queries to efficient workflows of MapReduce jobs. A design goal was to use HDFS as the physical storage layer, without any indexing, data partitioning, or data normalization, and to use Hadoop (without extensions) as the run-time engine. The motivation behind this work was to built a platform to test new ideas on query processing and optimization techniques applicable to the MapReduce framework. A year ago, MRQL was extended to run on Hama. The motivation for this extension was that Hadoop MapReduce jobs were required to read their input and write their output on HDFS. This simplifies reliability and fault tolerance but it imposes a high overhead to complex MapReduce workflows and graph algorithms, such as PageRank, which require repetitive jobs. In addition, Hadoop does not preserve data in memory across consecutive MapReduce jobs. This restriction requires to read data at every step, even when the data is constant. BSP, on the other hand, does not suffer from this restriction, and, under certain circumstances, allows complex repetitive algorithms to run entirely in the collective memory of a cluster. Thus, the goal was to be able to run the same MRQL queries in both modes, MapReduce and BSP, without modifying the queries: If there are enough resources available, and low latency and speed are more important than resilience, queries may run in BSP mode; otherwise, the same queries may run in MapReduce mode. BSP evaluation was found to be a good choice when fault tolerance is not critical, data (both input and intermediate) can fit in the cluster memory, and data processing requires complex/repetitive steps. The research results of this ongoing work have already been published in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors have already received positive feedback from researchers in academia and industry who were attending these conferences. = Rationale = * MRQL will be the first general-purpose, SQL-like query language for data analysis based on BSP. Currently, many programmers prefer to code their MapReduce applications in a higher-level query language, rather than an algorithmic language. For instance, Pig is used for 60% of Yahoo MapReduce jobs, while Hive is used for 90% of Facebook MapReduce jobs. This, we believe, will also be the trend for BSP applications, because, even though, in principle, the BSP model is very simple to understand, it is hard to develop, optimize, and maintain non-trivial BSP applications coded in a general-purpose programming language. Currently, there is no widely acceptable declarative BSP query language, although there are a few special-purpose BSP systems for graph analysis, such as Google Pregel and Apache Giraph, for machine learning, such as BSML, and for scientific data analysis. * MRQL can capture many complex data analysis