Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-06 Thread Edward J. Yoon
I think it's time to call for vote.

On Mon, Mar 4, 2013 at 9:25 PM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 Nice proposal indeed, I'd say having 3 mentors is usually better to avoid
 release headaches.
 Regards,
 Tommaso


 2013/3/4 Edward J. Yoon edwardy...@apache.org

 Sure I can. :)

 Of course, we'll welcome more mentors from incubator IPMC if there're
 volunteers.

 On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org
 wrote:
  On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz 
 bdelacre...@apache.org
  wrote:
 
  On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu
  wrote:
   == Champion ==
   * Edward J. Yoon edwardyoon AT apache DOT org
   == Nominated Mentors ==
   * Alex Karasulu akarasulu AT apache DOT org
  ...
 
  Is Edward going to stay on as a mentor as well?
 
  Two (active) mentors is the bare minimum IMO.
 
 
  I suspect so but let's hear from Edward himself.
 
  Best Regards,
  -- Alex



 --
 Best Regards, Edward J. Yoon
 @eddieyoon

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org





-- 
Best Regards, Edward J. Yoon
@eddieyoon

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-06 Thread Mohammad Nour El-Din
I added myself as a mentor. Welcome aboard.


On Wed, Mar 6, 2013 at 9:02 AM, Edward J. Yoon edwardy...@apache.orgwrote:

 I think it's time to call for vote.

 On Mon, Mar 4, 2013 at 9:25 PM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Nice proposal indeed, I'd say having 3 mentors is usually better to avoid
  release headaches.
  Regards,
  Tommaso
 
 
  2013/3/4 Edward J. Yoon edwardy...@apache.org
 
  Sure I can. :)
 
  Of course, we'll welcome more mentors from incubator IPMC if there're
  volunteers.
 
  On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org
  wrote:
   On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz 
  bdelacre...@apache.org
   wrote:
  
   On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras 
 fega...@cse.uta.edu
   wrote:
== Champion ==
* Edward J. Yoon edwardyoon AT apache DOT org
== Nominated Mentors ==
* Alex Karasulu akarasulu AT apache DOT org
   ...
  
   Is Edward going to stay on as a mentor as well?
  
   Two (active) mentors is the bare minimum IMO.
  
  
   I suspect so but let's hear from Edward himself.
  
   Best Regards,
   -- Alex
 
 
 
  --
  Best Regards, Edward J. Yoon
  @eddieyoon
 
  -
  To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
  For additional commands, e-mail: general-h...@incubator.apache.org
 
 



 --
 Best Regards, Edward J. Yoon
 @eddieyoon

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




-- 
Thanks
- Mohammad Nour

Life is like riding a bicycle. To keep your balance you must keep moving
- Albert Einstein


Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-04 Thread Bertrand Delacretaz
On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote:
 == Champion ==
 * Edward J. Yoon edwardyoon AT apache DOT org
 == Nominated Mentors ==
 * Alex Karasulu akarasulu AT apache DOT org
...

Is Edward going to stay on as a mentor as well?

Two (active) mentors is the bare minimum IMO.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-04 Thread Alex Karasulu
On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org
 wrote:

 On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu
 wrote:
  == Champion ==
  * Edward J. Yoon edwardyoon AT apache DOT org
  == Nominated Mentors ==
  * Alex Karasulu akarasulu AT apache DOT org
 ...

 Is Edward going to stay on as a mentor as well?

 Two (active) mentors is the bare minimum IMO.


I suspect so but let's hear from Edward himself.

Best Regards,
-- Alex


Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-04 Thread Edward J. Yoon
Sure I can. :)

Of course, we'll welcome more mentors from incubator IPMC if there're
volunteers.

On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org wrote:
 On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz bdelacre...@apache.org
 wrote:

 On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu
 wrote:
  == Champion ==
  * Edward J. Yoon edwardyoon AT apache DOT org
  == Nominated Mentors ==
  * Alex Karasulu akarasulu AT apache DOT org
 ...

 Is Edward going to stay on as a mentor as well?

 Two (active) mentors is the bare minimum IMO.


 I suspect so but let's hear from Edward himself.

 Best Regards,
 -- Alex



-- 
Best Regards, Edward J. Yoon
@eddieyoon

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-04 Thread Tommaso Teofili
Nice proposal indeed, I'd say having 3 mentors is usually better to avoid
release headaches.
Regards,
Tommaso


2013/3/4 Edward J. Yoon edwardy...@apache.org

 Sure I can. :)

 Of course, we'll welcome more mentors from incubator IPMC if there're
 volunteers.

 On Mon, Mar 4, 2013 at 7:34 PM, Alex Karasulu akaras...@apache.org
 wrote:
  On Mon, Mar 4, 2013 at 12:31 PM, Bertrand Delacretaz 
 bdelacre...@apache.org
  wrote:
 
  On Sat, Mar 2, 2013 at 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu
  wrote:
   == Champion ==
   * Edward J. Yoon edwardyoon AT apache DOT org
   == Nominated Mentors ==
   * Alex Karasulu akarasulu AT apache DOT org
  ...
 
  Is Edward going to stay on as a mentor as well?
 
  Two (active) mentors is the bare minimum IMO.
 
 
  I suspect so but let's hear from Edward himself.
 
  Best Regards,
  -- Alex



 --
 Best Regards, Edward J. Yoon
 @eddieyoon

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




[PROPOSAL] MRQL for the Apache Incubator

2013-03-02 Thread Leonidas Fegaras

Dear ASF members,

We would like to propose a new project to the incubator, called MRQL.
Edward J. Yoon has volunteered to be the champion for this project.
The proposal draft is available at:

http://wiki.apache.org/incubator/MRQLProposal

We are very excited about having this opportunity to work with ASF to
create an incubator project. We are looking forward to your feedback
and suggestions.
Best regards
Leonidas Fegaras


= Abstract =

MRQL is a query processing and optimization system for large-scale,
distributed data analysis, built on top of Apache Hadoop and Hama.

= Proposal =

MRQL (pronounced ''miracle'') is a query processing and optimization
system for large-scale, distributed data analysis. MRQL (the MapReduce
Query Language) is an SQL-like query language for large-scale data
analysis on a cluster of computers. The MRQL query processing system
can evaluate MRQL queries in two modes: in MapReduce mode on top of
Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of
Apache Hama. The MRQL query language is powerful enough to express
most common data analysis tasks over many forms of raw ''in-situ''
data, such as XML and JSON documents, binary files, and CSV
documents. MRQL is more powerful than other current high-level
MapReduce languages, such as Hive and PigLatin, since it can operate
on more complex data and supports more powerful query constructs, thus
eliminating the need for using explicit MapReduce code. With MRQL,
users will be able to express complex data analysis tasks, such as
PageRank, k-means clustering, matrix factorization, etc, using
SQL-like queries exclusively, while the MRQL query processing system
will be able to compile these queries to efficient Java code.

= Background =

The initial code was developed at the University of Texas of Arlington
(UTA) by a research team, led by Leonidas Fegaras. The software was
first released in May 2011. The original goal of this project was to
build a query processing system that translates SQL-like data analysis
queries to efficient workflows of MapReduce jobs. A design goal was to
use HDFS as the physical storage layer, without any indexing, data
partitioning, or data normalization, and to use Hadoop (without
extensions) as the run-time engine. The motivation behind this work
was to built a platform to test new ideas on query processing and
optimization techniques applicable to the MapReduce framework.

A year ago, MRQL was extended to run on Hama. The motivation for this
extension was that Hadoop MapReduce jobs were required to read their
input and write their output on HDFS. This simplifies reliability and
fault tolerance but it imposes a high overhead to complex MapReduce
workflows and graph algorithms, such as PageRank, which require
repetitive jobs. In addition, Hadoop does not preserve data in memory
across consecutive MapReduce jobs. This restriction requires to read
data at every step, even when the data is constant. BSP, on the other
hand, does not suffer from this restriction, and, under certain
circumstances, allows complex repetitive algorithms to run entirely in
the collective memory of a cluster. Thus, the goal was to be able to
run the same MRQL queries in both modes, MapReduce and BSP, without
modifying the queries: If there are enough resources available, and
low latency and speed are more important than resilience, queries may
run in BSP mode; otherwise, the same queries may run in MapReduce
mode. BSP evaluation was found to be a good choice when fault
tolerance is not critical, data (both input and intermediate) can fit
in the cluster memory, and data processing requires complex/repetitive
steps.

The research results of this ongoing work have already been published
in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors
have already received positive feedback from researchers in academia
and industry who were attending these conferences.

= Rationale =

* MRQL will be the first general-purpose, SQL-like query language for
data analysis based on BSP.
Currently, many programmers prefer to code their MapReduce
applications in a higher-level query language, rather than an
algorithmic language. For instance, Pig is used for 60% of Yahoo
MapReduce jobs, while Hive is used for 90% of Facebook MapReduce
jobs. This, we believe, will also be the trend for BSP applications,
because, even though, in principle, the BSP model is very simple to
understand, it is hard to develop, optimize, and maintain non-trivial
BSP applications coded in a general-purpose programming
language. Currently, there is no widely acceptable declarative BSP
query language, although there are a few special-purpose BSP systems
for graph analysis, such as Google Pregel and Apache Giraph, for
machine learning, such as BSML, and for scientific data analysis.

* MRQL can capture many complex data analysis algorithms in
declarative form.
Existing MapReduce query languages, such as HiveQL and PigLatin,
provide a limited syntax for 

Re: [PROPOSAL] MRQL for the Apache Incubator

2013-03-02 Thread Mattmann, Chris A (388J)
Sounds awesome guys look forward to the VOTE.

Cheers,
Chris

On 3/2/13 7:12 AM, Leonidas Fegaras fega...@cse.uta.edu wrote:

Dear ASF members,

We would like to propose a new project to the incubator, called MRQL.
Edward J. Yoon has volunteered to be the champion for this project.
The proposal draft is available at:

http://wiki.apache.org/incubator/MRQLProposal

We are very excited about having this opportunity to work with ASF to
create an incubator project. We are looking forward to your feedback
and suggestions.
Best regards
Leonidas Fegaras


= Abstract =

MRQL is a query processing and optimization system for large-scale,
distributed data analysis, built on top of Apache Hadoop and Hama.

= Proposal =

MRQL (pronounced ''miracle'') is a query processing and optimization
system for large-scale, distributed data analysis. MRQL (the MapReduce
Query Language) is an SQL-like query language for large-scale data
analysis on a cluster of computers. The MRQL query processing system
can evaluate MRQL queries in two modes: in MapReduce mode on top of
Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of
Apache Hama. The MRQL query language is powerful enough to express
most common data analysis tasks over many forms of raw ''in-situ''
data, such as XML and JSON documents, binary files, and CSV
documents. MRQL is more powerful than other current high-level
MapReduce languages, such as Hive and PigLatin, since it can operate
on more complex data and supports more powerful query constructs, thus
eliminating the need for using explicit MapReduce code. With MRQL,
users will be able to express complex data analysis tasks, such as
PageRank, k-means clustering, matrix factorization, etc, using
SQL-like queries exclusively, while the MRQL query processing system
will be able to compile these queries to efficient Java code.

= Background =

The initial code was developed at the University of Texas of Arlington
(UTA) by a research team, led by Leonidas Fegaras. The software was
first released in May 2011. The original goal of this project was to
build a query processing system that translates SQL-like data analysis
queries to efficient workflows of MapReduce jobs. A design goal was to
use HDFS as the physical storage layer, without any indexing, data
partitioning, or data normalization, and to use Hadoop (without
extensions) as the run-time engine. The motivation behind this work
was to built a platform to test new ideas on query processing and
optimization techniques applicable to the MapReduce framework.

A year ago, MRQL was extended to run on Hama. The motivation for this
extension was that Hadoop MapReduce jobs were required to read their
input and write their output on HDFS. This simplifies reliability and
fault tolerance but it imposes a high overhead to complex MapReduce
workflows and graph algorithms, such as PageRank, which require
repetitive jobs. In addition, Hadoop does not preserve data in memory
across consecutive MapReduce jobs. This restriction requires to read
data at every step, even when the data is constant. BSP, on the other
hand, does not suffer from this restriction, and, under certain
circumstances, allows complex repetitive algorithms to run entirely in
the collective memory of a cluster. Thus, the goal was to be able to
run the same MRQL queries in both modes, MapReduce and BSP, without
modifying the queries: If there are enough resources available, and
low latency and speed are more important than resilience, queries may
run in BSP mode; otherwise, the same queries may run in MapReduce
mode. BSP evaluation was found to be a good choice when fault
tolerance is not critical, data (both input and intermediate) can fit
in the cluster memory, and data processing requires complex/repetitive
steps.

The research results of this ongoing work have already been published
in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors
have already received positive feedback from researchers in academia
and industry who were attending these conferences.

= Rationale =

* MRQL will be the first general-purpose, SQL-like query language for
data analysis based on BSP.
Currently, many programmers prefer to code their MapReduce
applications in a higher-level query language, rather than an
algorithmic language. For instance, Pig is used for 60% of Yahoo
MapReduce jobs, while Hive is used for 90% of Facebook MapReduce
jobs. This, we believe, will also be the trend for BSP applications,
because, even though, in principle, the BSP model is very simple to
understand, it is hard to develop, optimize, and maintain non-trivial
BSP applications coded in a general-purpose programming
language. Currently, there is no widely acceptable declarative BSP
query language, although there are a few special-purpose BSP systems
for graph analysis, such as Google Pregel and Apache Giraph, for
machine learning, such as BSML, and for scientific data analysis.

* MRQL can capture many complex data analysis