Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-21 Thread Mike Carey

Thanks, Steve!!  (We'd love to talk there, BTW; the challenge is doing
so w/a teaching day-job.  We'll see if we can find a volunteer who's
not schedule-conflicted that week...!)

On 1/21/15 2:44 AM, Steve Loughran wrote:

+1 for the proposal: I've a lot of respect for the team...I met some of
them at a workshop in Germany a few years back along with the (then)
Stratosphere project.

I'm would volunteer as a mentor except I'm fairly overcommitted with other
things (like the slider incubating project). If it does need rounding out
I'll add my name to the list.

Mike: note that you have until Feb 1 to get a proposal for a paper in for
ApacheCon: http://apachecon.com/
You might want to think about doing that, as it's a great way to get known
by the community.

-Steve


On 15 January 2015 at 02:21, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:


Hi Folks,

I am pleased to bring forth the Apache AsterixDB proposal to the
Apache Incubator as Champion, working in collaboration with the
team. Please find the wiki proposal here:

https://wiki.apache.org/incubator/AsterixDBProposal


Full text of the proposal is below. Please discuss and enjoy. I’ll
leave the discussion open for a week, and then look to call a VOTE
hopefully end of next week if all is well.

Cheers!
Chris Mattmann

=
Apache AsterixDB Proposal

Abstract

Apache AsterixDB is a scalable big data management system (BDMS) that
provides storage, management, and query capabilities for large
collections of semi-structured data.

Proposal

AsterixDB is a big data management system (BDMS) that makes it
well-suited to needs such as web data warehousing and social data
storage and analysis. Feature-wise, AsterixDB has:

* A NoSQL style data model (ADM) based on extending JSON with object
   database concepts.
* An expressive and declarative query language (AQL) for querying
   semi-structured data.
* A runtime query execution engine, Hyracks, for partitioned-parallel
   execution of query plans.
* Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
* Support for querying and indexing external data (e.g., in HDFS) as
   well as data stored within AsterixDB.
* A rich set of primitive data types, including support for spatial,
   temporal, and textual data.
* Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
* Basic transactional (concurrency and recovery) capabilities akin to
   those of a NoSQL store.


Background and Rationale

In the world of relational databases, the need to tackle data volumes
that exceed the capabilities of a single server led to the
development of “shared-nothing” parallel database systems several
decades ago. These systems spread data over a cluster based on a
partitioning strategy, such as hash partitioning, and queries are
processed by employing partitioned-parallel divide-and-conquer
techniques. Since these systems are fronted by a high-level,
declarative language (SQL), their users are shielded from the
complexities of parallel programming. Parallel database systems have
been an extremely successful application of parallel computing, and
quite a number of commercial products exist today.

In the distributed systems world, the Web brought a need to index and
query its huge content. SQL and relational databases were not the
answer, though shared-nothing clusters again emerged as the hardware
platform of choice. Google developed the Google File System (GFS) and
MapReduce programming model to allow programmers to store and process
Big Data by writing a few user-defined functions. The MapReduce
framework applies these functions in parallel to data instances in
distributed files (map) and to sorted groups of instances sharing a
common key (reduce) -- not unlike the partitioned parallelism in
parallel database systems. Apache's Hadoop MapReduce platform is the
most prominent implementation of this paradigm for the rest of the
Big Data community. On top of Hadoop and HDFS sit declarative
languages like Pig and Hive that each compile down to Hadoop
MapReduce jobs.

The big Web companies were also challenged by extreme user bases
(100s of millions of users) and needed fast simple lookups and
updates to very large keyed data sets like user profiles. SQL
databases were deemed either too expensive or not scalable, so the
“NoSQL movement” was born. The ASF now has HBase and Cassandra, two
popular key-value stores, in this space. MongoDB and Couchbase are
other open source alternatives (document stores).

It is evident from the rapidly growing popularity of NoSQL stores,
as well as the strong demand for Big Data analytics engines today,
that there is a strong (and growing!) need to store, process, *and*
query large volumes of semi-structured data in many application
areas. Until very recently, developers have had to ``choose'' between
using big data analytics engines like Apache Hive or 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-21 Thread Steve Loughran
+1 for the proposal: I've a lot of respect for the team...I met some of
them at a workshop in Germany a few years back along with the (then)
Stratosphere project.

I'm would volunteer as a mentor except I'm fairly overcommitted with other
things (like the slider incubating project). If it does need rounding out
I'll add my name to the list.

Mike: note that you have until Feb 1 to get a proposal for a paper in for
ApacheCon: http://apachecon.com/
You might want to think about doing that, as it's a great way to get known
by the community.

-Steve


On 15 January 2015 at 02:21, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:

 https://wiki.apache.org/incubator/AsterixDBProposal


 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.

 Cheers!
 Chris Mattmann

 =
 Apache AsterixDB Proposal

 Abstract

 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.

 Proposal

 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:

 * A NoSQL style data model (ADM) based on extending JSON with object
   database concepts.
 * An expressive and declarative query language (AQL) for querying
   semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
   execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
   well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
   temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
   those of a NoSQL store.


 Background and Rationale

 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.

 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.

 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document stores).

 It is evident from the rapidly growing popularity of NoSQL stores,
 as well as the strong demand for Big Data analytics engines today,
 that there is a strong (and growing!) need to store, process, *and*
 query large volumes of semi-structured data in many application
 areas. Until very recently, developers have had to ``choose'' between
 using big data analytics engines like Apache Hive or Apache Spark,
 which can do complex query processing and analysis over HDFS-resident
 files, and flexible but low-function data stores like MongoDB or
 Apache 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-20 Thread Till Westmann
Hi Henry,

thanks! It’s great that you’ve seen (and liked) AsterixDB before.

Even if your time is very limited we would be very happy to have you on board 
as a mentor.
I’ll add you to the proposal.

Cheers,
Till

 On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote:
 
 +1 This is GREAT News!
 
 Was watching and trying AsterixDB last year and looked in awesome shape.
 
 I have my plate full but would love to help mentor this project to get
 it going to ASF if needed!
 
 - Henry
 
 On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,
 
 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:
 
 https://wiki.apache.org/incubator/AsterixDBProposal
 
 
 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.
 
 Cheers!
 Chris Mattmann
 
 =
 Apache AsterixDB Proposal
 
 Abstract
 
 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.
 
 Proposal
 
 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:
 
 * A NoSQL style data model (ADM) based on extending JSON with object
  database concepts.
 * An expressive and declarative query language (AQL) for querying
  semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
  execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
  ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
  well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
  temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
  keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
  those of a NoSQL store.
 
 
 Background and Rationale
 
 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.
 
 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.
 
 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document stores).
 
 It is evident from the rapidly growing popularity of NoSQL stores,
 as well as the strong demand for Big Data analytics engines today,
 that there is a strong (and growing!) need to store, process, *and*
 query large volumes of semi-structured data in many application
 areas. Until very recently, developers have had to ``choose'' between
 using big data analytics engines like Apache Hive or Apache Spark,
 which can do complex query processing and analysis over HDFS-resident
 files, and flexible but low-function data stores like MongoDB or
 Apache HBase. (The Apache Phoenix 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-20 Thread Till Westmann

 On Jan 19, 2015, at 11:34 AM, jan i j...@apache.org wrote:
 
 Looks like a real challenging project, and the proposal looks as if it has 
 already been through a couple of refinement rounds.
 
 Count on my +1, when it comes to voting.

Will do!

Thanks,
Till

 
 rgds
 jan i
 
 On 19 January 2015 at 19:26, Henry Saputra henry.sapu...@gmail.com 
 mailto:henry.sapu...@gmail.com wrote:
 +1 This is GREAT News!
 
 Was watching and trying AsterixDB last year and looked in awesome shape.
 
 I have my plate full but would love to help mentor this project to get
 it going to ASF if needed!
 
 - Henry
 
 On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Folks,
 
  I am pleased to bring forth the Apache AsterixDB proposal to the
  Apache Incubator as Champion, working in collaboration with the
  team. Please find the wiki proposal here:
 
  https://wiki.apache.org/incubator/AsterixDBProposal 
  https://wiki.apache.org/incubator/AsterixDBProposal
 
 
  Full text of the proposal is below. Please discuss and enjoy. I’ll
  leave the discussion open for a week, and then look to call a VOTE
  hopefully end of next week if all is well.
 
  Cheers!
  Chris Mattmann
 
  =
  Apache AsterixDB Proposal
 
  Abstract
 
  Apache AsterixDB is a scalable big data management system (BDMS) that
  provides storage, management, and query capabilities for large
  collections of semi-structured data.
 
  Proposal
 
  AsterixDB is a big data management system (BDMS) that makes it
  well-suited to needs such as web data warehousing and social data
  storage and analysis. Feature-wise, AsterixDB has:
 
  * A NoSQL style data model (ADM) based on extending JSON with object
database concepts.
  * An expressive and declarative query language (AQL) for querying
semi-structured data.
  * A runtime query execution engine, Hyracks, for partitioned-parallel
execution of query plans.
  * Partitioned LSM-based data storage and indexing for efficient
ingestion of newly arriving data.
  * Support for querying and indexing external data (e.g., in HDFS) as
well as data stored within AsterixDB.
  * A rich set of primitive data types, including support for spatial,
temporal, and textual data.
  * Indexing options that include B+ trees, R trees, and inverted
keyword index support.
  * Basic transactional (concurrency and recovery) capabilities akin to
those of a NoSQL store.
 
 
  Background and Rationale
 
  In the world of relational databases, the need to tackle data volumes
  that exceed the capabilities of a single server led to the
  development of “shared-nothing” parallel database systems several
  decades ago. These systems spread data over a cluster based on a
  partitioning strategy, such as hash partitioning, and queries are
  processed by employing partitioned-parallel divide-and-conquer
  techniques. Since these systems are fronted by a high-level,
  declarative language (SQL), their users are shielded from the
  complexities of parallel programming. Parallel database systems have
  been an extremely successful application of parallel computing, and
  quite a number of commercial products exist today.
 
  In the distributed systems world, the Web brought a need to index and
  query its huge content. SQL and relational databases were not the
  answer, though shared-nothing clusters again emerged as the hardware
  platform of choice. Google developed the Google File System (GFS) and
  MapReduce programming model to allow programmers to store and process
  Big Data by writing a few user-defined functions. The MapReduce
  framework applies these functions in parallel to data instances in
  distributed files (map) and to sorted groups of instances sharing a
  common key (reduce) -- not unlike the partitioned parallelism in
  parallel database systems. Apache's Hadoop MapReduce platform is the
  most prominent implementation of this paradigm for the rest of the
  Big Data community. On top of Hadoop and HDFS sit declarative
  languages like Pig and Hive that each compile down to Hadoop
  MapReduce jobs.
 
  The big Web companies were also challenged by extreme user bases
  (100s of millions of users) and needed fast simple lookups and
  updates to very large keyed data sets like user profiles. SQL
  databases were deemed either too expensive or not scalable, so the
  “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
  popular key-value stores, in this space. MongoDB and Couchbase are
  other open source alternatives (document stores).
 
  It is evident from the rapidly growing popularity of NoSQL stores,
  as well as the strong demand for Big Data analytics engines today,
  that there is a strong (and growing!) need to store, process, *and*
  query large volumes of semi-structured data in many application
  areas. Until very recently, developers have had 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-20 Thread Till Westmann
Thank you.
So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be 
sufficient even for the relatively large number of new committers.

Till

 On Jan 19, 2015, at 8:17 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 
 Thanks Till,
 
 Will try to solicit more mentors to help.
 Especially with initial committers mostly have not been exposed to
 contributing the Apache way.
 
 - Henry
 
 On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote:
 Hi Henry,
 
 thanks! It’s great that you’ve seen (and liked) AsterixDB before.
 
 Even if your time is very limited we would be very happy to have you on 
 board as a mentor.
 I’ll add you to the proposal.
 
 Cheers,
 Till
 
 On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote:
 
 +1 This is GREAT News!
 
 Was watching and trying AsterixDB last year and looked in awesome shape.
 
 I have my plate full but would love to help mentor this project to get
 it going to ASF if needed!
 
 - Henry
 
 On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,
 
 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:
 
 https://wiki.apache.org/incubator/AsterixDBProposal
 
 
 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.
 
 Cheers!
 Chris Mattmann
 
 =
 Apache AsterixDB Proposal
 
 Abstract
 
 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.
 
 Proposal
 
 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:
 
 * A NoSQL style data model (ADM) based on extending JSON with object
 database concepts.
 * An expressive and declarative query language (AQL) for querying
 semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
 execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
 ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
 well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
 temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
 keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
 those of a NoSQL store.
 
 
 Background and Rationale
 
 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.
 
 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.
 
 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document stores).
 
 It is evident from the rapidly growing popularity of NoSQL stores,
 as well as the strong demand for 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-20 Thread Alan D. Cabrera
Should be fine.


Regards,
Alan

 On Jan 19, 2015, at 8:27 PM, Till Westmann t...@westmann.org wrote:
 
 Thank you.
 So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be 
 sufficient even for the relatively large number of new committers.
 
 Till
 
 On Jan 19, 2015, at 8:17 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 
 Thanks Till,
 
 Will try to solicit more mentors to help.
 Especially with initial committers mostly have not been exposed to
 contributing the Apache way.
 
 - Henry
 
 On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote:
 Hi Henry,
 
 thanks! It’s great that you’ve seen (and liked) AsterixDB before.
 
 Even if your time is very limited we would be very happy to have you on 
 board as a mentor.
 I’ll add you to the proposal.
 
 Cheers,
 Till
 
 On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com 
 wrote:
 
 +1 This is GREAT News!
 
 Was watching and trying AsterixDB last year and looked in awesome shape.
 
 I have my plate full but would love to help mentor this project to get
 it going to ASF if needed!
 
 - Henry
 
 On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,
 
 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:
 
 https://wiki.apache.org/incubator/AsterixDBProposal
 
 
 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.
 
 Cheers!
 Chris Mattmann
 
 =
 Apache AsterixDB Proposal
 
 Abstract
 
 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.
 
 Proposal
 
 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:
 
 * A NoSQL style data model (ADM) based on extending JSON with object
 database concepts.
 * An expressive and declarative query language (AQL) for querying
 semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
 execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
 ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
 well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
 temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
 keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
 those of a NoSQL store.
 
 
 Background and Rationale
 
 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.
 
 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.
 
 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-20 Thread Mike Carey

Excellent; thanks, Jochen!!
Cheers,
Mike

On 1/19/15 11:44 PM, Jochen Wiedmann wrote:

Hi, Chris,

I am interested in the proposal and (following up to my involvement
with VXQuery in the past) would like to offer myself as a mentor.

Jochen


On Thu, Jan 15, 2015 at 3:21 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:

Hi Folks,

I am pleased to bring forth the Apache AsterixDB proposal to the
Apache Incubator as Champion, working in collaboration with the
team. Please find the wiki proposal here:

https://wiki.apache.org/incubator/AsterixDBProposal


Full text of the proposal is below. Please discuss and enjoy. I’ll
leave the discussion open for a week, and then look to call a VOTE
hopefully end of next week if all is well.

Cheers!
Chris Mattmann

=
Apache AsterixDB Proposal

Abstract

Apache AsterixDB is a scalable big data management system (BDMS) that
provides storage, management, and query capabilities for large
collections of semi-structured data.

Proposal

AsterixDB is a big data management system (BDMS) that makes it
well-suited to needs such as web data warehousing and social data
storage and analysis. Feature-wise, AsterixDB has:

* A NoSQL style data model (ADM) based on extending JSON with object
   database concepts.
* An expressive and declarative query language (AQL) for querying
   semi-structured data.
* A runtime query execution engine, Hyracks, for partitioned-parallel
   execution of query plans.
* Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
* Support for querying and indexing external data (e.g., in HDFS) as
   well as data stored within AsterixDB.
* A rich set of primitive data types, including support for spatial,
   temporal, and textual data.
* Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
* Basic transactional (concurrency and recovery) capabilities akin to
   those of a NoSQL store.


Background and Rationale

In the world of relational databases, the need to tackle data volumes
that exceed the capabilities of a single server led to the
development of “shared-nothing” parallel database systems several
decades ago. These systems spread data over a cluster based on a
partitioning strategy, such as hash partitioning, and queries are
processed by employing partitioned-parallel divide-and-conquer
techniques. Since these systems are fronted by a high-level,
declarative language (SQL), their users are shielded from the
complexities of parallel programming. Parallel database systems have
been an extremely successful application of parallel computing, and
quite a number of commercial products exist today.

In the distributed systems world, the Web brought a need to index and
query its huge content. SQL and relational databases were not the
answer, though shared-nothing clusters again emerged as the hardware
platform of choice. Google developed the Google File System (GFS) and
MapReduce programming model to allow programmers to store and process
Big Data by writing a few user-defined functions. The MapReduce
framework applies these functions in parallel to data instances in
distributed files (map) and to sorted groups of instances sharing a
common key (reduce) -- not unlike the partitioned parallelism in
parallel database systems. Apache's Hadoop MapReduce platform is the
most prominent implementation of this paradigm for the rest of the
Big Data community. On top of Hadoop and HDFS sit declarative
languages like Pig and Hive that each compile down to Hadoop
MapReduce jobs.

The big Web companies were also challenged by extreme user bases
(100s of millions of users) and needed fast simple lookups and
updates to very large keyed data sets like user profiles. SQL
databases were deemed either too expensive or not scalable, so the
“NoSQL movement” was born. The ASF now has HBase and Cassandra, two
popular key-value stores, in this space. MongoDB and Couchbase are
other open source alternatives (document stores).

It is evident from the rapidly growing popularity of NoSQL stores,
as well as the strong demand for Big Data analytics engines today,
that there is a strong (and growing!) need to store, process, *and*
query large volumes of semi-structured data in many application
areas. Until very recently, developers have had to ``choose'' between
using big data analytics engines like Apache Hive or Apache Spark,
which can do complex query processing and analysis over HDFS-resident
files, and flexible but low-function data stores like MongoDB or
Apache HBase. (The Apache Phoenix project,
http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
aims to bridge between these choices.)

AsterixDB is a highly scalable data management system that can store,
index, and manage semi-structured data, e.g., much like MongoDB, but
it also supports a full-power query language with the expressiveness
of SQL (and more). Unlike 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-20 Thread Ted Dunning
Added my name to the mentor list.



On Tue, Jan 20, 2015 at 8:37 AM, Mike Carey dtab...@gmail.com wrote:

  Wonderful; thanks, Ted!!
 Cheers,
 Mike

  On 1/19/15 11:29 PM, Ted Dunning wrote:


 Chris just asked me under separate cover.

  I am happy to help out as mentor.



 On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:

 Thanks Till,

 Will try to solicit more mentors to help.
 Especially with initial committers mostly have not been exposed to
 contributing the Apache way.

 - Henry

 On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote:
  Hi Henry,
 
  thanks! It’s great that you’ve seen (and liked) AsterixDB before.
 
  Even if your time is very limited we would be very happy to have you on
 board as a mentor.
  I’ll add you to the proposal.
 
  Cheers,
  Till
 
  On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:
 
  +1 This is GREAT News!
 
  Was watching and trying AsterixDB last year and looked in awesome
 shape.
 
  I have my plate full but would love to help mentor this project to get
  it going to ASF if needed!
 
  - Henry
 
  On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
  chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Folks,
 
  I am pleased to bring forth the Apache AsterixDB proposal to the
  Apache Incubator as Champion, working in collaboration with the
  team. Please find the wiki proposal here:
 
  https://wiki.apache.org/incubator/AsterixDBProposal
 
 
  Full text of the proposal is below. Please discuss and enjoy. I’ll
  leave the discussion open for a week, and then look to call a VOTE
  hopefully end of next week if all is well.
 
  Cheers!
  Chris Mattmann
 
  =
  Apache AsterixDB Proposal
 
  Abstract
 
  Apache AsterixDB is a scalable big data management system (BDMS) that
  provides storage, management, and query capabilities for large
  collections of semi-structured data.
 
  Proposal
 
  AsterixDB is a big data management system (BDMS) that makes it
  well-suited to needs such as web data warehousing and social data
  storage and analysis. Feature-wise, AsterixDB has:
 
  * A NoSQL style data model (ADM) based on extending JSON with object
   database concepts.
  * An expressive and declarative query language (AQL) for querying
   semi-structured data.
  * A runtime query execution engine, Hyracks, for partitioned-parallel
   execution of query plans.
  * Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
  * Support for querying and indexing external data (e.g., in HDFS) as
   well as data stored within AsterixDB.
  * A rich set of primitive data types, including support for spatial,
   temporal, and textual data.
  * Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
  * Basic transactional (concurrency and recovery) capabilities akin to
   those of a NoSQL store.
 
 
  Background and Rationale
 
  In the world of relational databases, the need to tackle data volumes
  that exceed the capabilities of a single server led to the
  development of “shared-nothing” parallel database systems several
  decades ago. These systems spread data over a cluster based on a
  partitioning strategy, such as hash partitioning, and queries are
  processed by employing partitioned-parallel divide-and-conquer
  techniques. Since these systems are fronted by a high-level,
  declarative language (SQL), their users are shielded from the
  complexities of parallel programming. Parallel database systems have
  been an extremely successful application of parallel computing, and
  quite a number of commercial products exist today.
 
  In the distributed systems world, the Web brought a need to index and
  query its huge content. SQL and relational databases were not the
  answer, though shared-nothing clusters again emerged as the hardware
  platform of choice. Google developed the Google File System (GFS) and
  MapReduce programming model to allow programmers to store and process
  Big Data by writing a few user-defined functions. The MapReduce
  framework applies these functions in parallel to data instances in
  distributed files (map) and to sorted groups of instances sharing a
  common key (reduce) -- not unlike the partitioned parallelism in
  parallel database systems. Apache's Hadoop MapReduce platform is the
  most prominent implementation of this paradigm for the rest of the
  Big Data community. On top of Hadoop and HDFS sit declarative
  languages like Pig and Hive that each compile down to Hadoop
  MapReduce jobs.
 
  The big Web companies were also challenged by extreme user bases
  (100s of millions of users) and needed fast simple lookups and
  updates to very large keyed data sets like user profiles. SQL
  databases were deemed either too expensive or not scalable, so the
  “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
  popular 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-19 Thread Mike Carey

Ditto - thanks for the support!
Cheers,
Mike

On 1/19/15 5:39 PM, Till Westmann wrote:


On Jan 19, 2015, at 11:34 AM, jan i j...@apache.org 
mailto:j...@apache.org wrote:


Looks like a real challenging project, and the proposal looks as if 
it has already been through a couple of refinement rounds.


Count on my +1, when it comes to voting.


Will do!

Thanks,
Till



rgds
jan i

On 19 January 2015 at 19:26, Henry Saputra henry.sapu...@gmail.com 
mailto:henry.sapu...@gmail.com wrote:


+1 This is GREAT News!

Was watching and trying AsterixDB last year and looked in awesome
shape.

I have my plate full but would love to help mentor this project
to get
it going to ASF if needed!

- Henry

On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov
mailto:chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,

 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:

 https://wiki.apache.org/incubator/AsterixDBProposal


 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.

 Cheers!
 Chris Mattmann

 =
 Apache AsterixDB Proposal

 Abstract

 Apache AsterixDB is a scalable big data management system
(BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.

 Proposal

 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:

 * A NoSQL style data model (ADM) based on extending JSON with
object
   database concepts.
 * An expressive and declarative query language (AQL) for querying
   semi-structured data.
 * A runtime query execution engine, Hyracks, for
partitioned-parallel
   execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in
HDFS) as
   well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for
spatial,
   temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
 * Basic transactional (concurrency and recovery) capabilities
akin to
   those of a NoSQL store.


 Background and Rationale

 In the world of relational databases, the need to tackle data
volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems
have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.

 In the distributed systems world, the Web brought a need to
index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the
hardware
 platform of choice. Google developed the Google File System
(GFS) and
 MapReduce programming model to allow programmers to store and
process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform
is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.

 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-19 Thread Mike Carey

Indeed - thanks!!
Cheers,
Mike

On 1/19/15 5:28 PM, Till Westmann wrote:

Hi Henry,

thanks! It’s great that you’ve seen (and liked) AsterixDB before.

Even if your time is very limited we would be very happy to have you on board 
as a mentor.
I’ll add you to the proposal.

Cheers,
Till


On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote:

+1 This is GREAT News!

Was watching and trying AsterixDB last year and looked in awesome shape.

I have my plate full but would love to help mentor this project to get
it going to ASF if needed!

- Henry

On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:

Hi Folks,

I am pleased to bring forth the Apache AsterixDB proposal to the
Apache Incubator as Champion, working in collaboration with the
team. Please find the wiki proposal here:

https://wiki.apache.org/incubator/AsterixDBProposal


Full text of the proposal is below. Please discuss and enjoy. I’ll
leave the discussion open for a week, and then look to call a VOTE
hopefully end of next week if all is well.

Cheers!
Chris Mattmann

=
Apache AsterixDB Proposal

Abstract

Apache AsterixDB is a scalable big data management system (BDMS) that
provides storage, management, and query capabilities for large
collections of semi-structured data.

Proposal

AsterixDB is a big data management system (BDMS) that makes it
well-suited to needs such as web data warehousing and social data
storage and analysis. Feature-wise, AsterixDB has:

* A NoSQL style data model (ADM) based on extending JSON with object
  database concepts.
* An expressive and declarative query language (AQL) for querying
  semi-structured data.
* A runtime query execution engine, Hyracks, for partitioned-parallel
  execution of query plans.
* Partitioned LSM-based data storage and indexing for efficient
  ingestion of newly arriving data.
* Support for querying and indexing external data (e.g., in HDFS) as
  well as data stored within AsterixDB.
* A rich set of primitive data types, including support for spatial,
  temporal, and textual data.
* Indexing options that include B+ trees, R trees, and inverted
  keyword index support.
* Basic transactional (concurrency and recovery) capabilities akin to
  those of a NoSQL store.


Background and Rationale

In the world of relational databases, the need to tackle data volumes
that exceed the capabilities of a single server led to the
development of “shared-nothing” parallel database systems several
decades ago. These systems spread data over a cluster based on a
partitioning strategy, such as hash partitioning, and queries are
processed by employing partitioned-parallel divide-and-conquer
techniques. Since these systems are fronted by a high-level,
declarative language (SQL), their users are shielded from the
complexities of parallel programming. Parallel database systems have
been an extremely successful application of parallel computing, and
quite a number of commercial products exist today.

In the distributed systems world, the Web brought a need to index and
query its huge content. SQL and relational databases were not the
answer, though shared-nothing clusters again emerged as the hardware
platform of choice. Google developed the Google File System (GFS) and
MapReduce programming model to allow programmers to store and process
Big Data by writing a few user-defined functions. The MapReduce
framework applies these functions in parallel to data instances in
distributed files (map) and to sorted groups of instances sharing a
common key (reduce) -- not unlike the partitioned parallelism in
parallel database systems. Apache's Hadoop MapReduce platform is the
most prominent implementation of this paradigm for the rest of the
Big Data community. On top of Hadoop and HDFS sit declarative
languages like Pig and Hive that each compile down to Hadoop
MapReduce jobs.

The big Web companies were also challenged by extreme user bases
(100s of millions of users) and needed fast simple lookups and
updates to very large keyed data sets like user profiles. SQL
databases were deemed either too expensive or not scalable, so the
“NoSQL movement” was born. The ASF now has HBase and Cassandra, two
popular key-value stores, in this space. MongoDB and Couchbase are
other open source alternatives (document stores).

It is evident from the rapidly growing popularity of NoSQL stores,
as well as the strong demand for Big Data analytics engines today,
that there is a strong (and growing!) need to store, process, *and*
query large volumes of semi-structured data in many application
areas. Until very recently, developers have had to ``choose'' between
using big data analytics engines like Apache Hive or Apache Spark,
which can do complex query processing and analysis over HDFS-resident
files, and flexible but low-function data stores like MongoDB or
Apache HBase. (The Apache Phoenix project,

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-19 Thread Henry Saputra
Thanks Till,

Will try to solicit more mentors to help.
Especially with initial committers mostly have not been exposed to
contributing the Apache way.

- Henry

On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote:
 Hi Henry,

 thanks! It’s great that you’ve seen (and liked) AsterixDB before.

 Even if your time is very limited we would be very happy to have you on board 
 as a mentor.
 I’ll add you to the proposal.

 Cheers,
 Till

 On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote:

 +1 This is GREAT News!

 Was watching and trying AsterixDB last year and looked in awesome shape.

 I have my plate full but would love to help mentor this project to get
 it going to ASF if needed!

 - Henry

 On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,

 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:

 https://wiki.apache.org/incubator/AsterixDBProposal


 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.

 Cheers!
 Chris Mattmann

 =
 Apache AsterixDB Proposal

 Abstract

 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.

 Proposal

 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:

 * A NoSQL style data model (ADM) based on extending JSON with object
  database concepts.
 * An expressive and declarative query language (AQL) for querying
  semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
  execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
  ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
  well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
  temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
  keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
  those of a NoSQL store.


 Background and Rationale

 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.

 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.

 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document stores).

 It is evident from the rapidly growing popularity of NoSQL stores,
 as well as the strong demand for Big Data analytics engines today,
 that there is a strong (and growing!) need to store, process, *and*
 query large volumes of semi-structured data in many application
 areas. Until very recently, developers have had to ``choose'' between
 using big data analytics 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-19 Thread Ted Dunning
Chris just asked me under separate cover.

I am happy to help out as mentor.



On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 Thanks Till,

 Will try to solicit more mentors to help.
 Especially with initial committers mostly have not been exposed to
 contributing the Apache way.

 - Henry

 On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote:
  Hi Henry,
 
  thanks! It’s great that you’ve seen (and liked) AsterixDB before.
 
  Even if your time is very limited we would be very happy to have you on
 board as a mentor.
  I’ll add you to the proposal.
 
  Cheers,
  Till
 
  On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:
 
  +1 This is GREAT News!
 
  Was watching and trying AsterixDB last year and looked in awesome shape.
 
  I have my plate full but would love to help mentor this project to get
  it going to ASF if needed!
 
  - Henry
 
  On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
  chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Folks,
 
  I am pleased to bring forth the Apache AsterixDB proposal to the
  Apache Incubator as Champion, working in collaboration with the
  team. Please find the wiki proposal here:
 
  https://wiki.apache.org/incubator/AsterixDBProposal
 
 
  Full text of the proposal is below. Please discuss and enjoy. I’ll
  leave the discussion open for a week, and then look to call a VOTE
  hopefully end of next week if all is well.
 
  Cheers!
  Chris Mattmann
 
  =
  Apache AsterixDB Proposal
 
  Abstract
 
  Apache AsterixDB is a scalable big data management system (BDMS) that
  provides storage, management, and query capabilities for large
  collections of semi-structured data.
 
  Proposal
 
  AsterixDB is a big data management system (BDMS) that makes it
  well-suited to needs such as web data warehousing and social data
  storage and analysis. Feature-wise, AsterixDB has:
 
  * A NoSQL style data model (ADM) based on extending JSON with object
   database concepts.
  * An expressive and declarative query language (AQL) for querying
   semi-structured data.
  * A runtime query execution engine, Hyracks, for partitioned-parallel
   execution of query plans.
  * Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
  * Support for querying and indexing external data (e.g., in HDFS) as
   well as data stored within AsterixDB.
  * A rich set of primitive data types, including support for spatial,
   temporal, and textual data.
  * Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
  * Basic transactional (concurrency and recovery) capabilities akin to
   those of a NoSQL store.
 
 
  Background and Rationale
 
  In the world of relational databases, the need to tackle data volumes
  that exceed the capabilities of a single server led to the
  development of “shared-nothing” parallel database systems several
  decades ago. These systems spread data over a cluster based on a
  partitioning strategy, such as hash partitioning, and queries are
  processed by employing partitioned-parallel divide-and-conquer
  techniques. Since these systems are fronted by a high-level,
  declarative language (SQL), their users are shielded from the
  complexities of parallel programming. Parallel database systems have
  been an extremely successful application of parallel computing, and
  quite a number of commercial products exist today.
 
  In the distributed systems world, the Web brought a need to index and
  query its huge content. SQL and relational databases were not the
  answer, though shared-nothing clusters again emerged as the hardware
  platform of choice. Google developed the Google File System (GFS) and
  MapReduce programming model to allow programmers to store and process
  Big Data by writing a few user-defined functions. The MapReduce
  framework applies these functions in parallel to data instances in
  distributed files (map) and to sorted groups of instances sharing a
  common key (reduce) -- not unlike the partitioned parallelism in
  parallel database systems. Apache's Hadoop MapReduce platform is the
  most prominent implementation of this paradigm for the rest of the
  Big Data community. On top of Hadoop and HDFS sit declarative
  languages like Pig and Hive that each compile down to Hadoop
  MapReduce jobs.
 
  The big Web companies were also challenged by extreme user bases
  (100s of millions of users) and needed fast simple lookups and
  updates to very large keyed data sets like user profiles. SQL
  databases were deemed either too expensive or not scalable, so the
  “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
  popular key-value stores, in this space. MongoDB and Couchbase are
  other open source alternatives (document stores).
 
  It is evident from the rapidly growing popularity of NoSQL stores,
  as well as the strong 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-19 Thread jan i
Looks like a real challenging project, and the proposal looks as if it has
already been through a couple of refinement rounds.

Count on my +1, when it comes to voting.

rgds
jan i

On 19 January 2015 at 19:26, Henry Saputra henry.sapu...@gmail.com wrote:

 +1 This is GREAT News!

 Was watching and trying AsterixDB last year and looked in awesome shape.

 I have my plate full but would love to help mentor this project to get
 it going to ASF if needed!

 - Henry

 On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Folks,
 
  I am pleased to bring forth the Apache AsterixDB proposal to the
  Apache Incubator as Champion, working in collaboration with the
  team. Please find the wiki proposal here:
 
  https://wiki.apache.org/incubator/AsterixDBProposal
 
 
  Full text of the proposal is below. Please discuss and enjoy. I’ll
  leave the discussion open for a week, and then look to call a VOTE
  hopefully end of next week if all is well.
 
  Cheers!
  Chris Mattmann
 
  =
  Apache AsterixDB Proposal
 
  Abstract
 
  Apache AsterixDB is a scalable big data management system (BDMS) that
  provides storage, management, and query capabilities for large
  collections of semi-structured data.
 
  Proposal
 
  AsterixDB is a big data management system (BDMS) that makes it
  well-suited to needs such as web data warehousing and social data
  storage and analysis. Feature-wise, AsterixDB has:
 
  * A NoSQL style data model (ADM) based on extending JSON with object
database concepts.
  * An expressive and declarative query language (AQL) for querying
semi-structured data.
  * A runtime query execution engine, Hyracks, for partitioned-parallel
execution of query plans.
  * Partitioned LSM-based data storage and indexing for efficient
ingestion of newly arriving data.
  * Support for querying and indexing external data (e.g., in HDFS) as
well as data stored within AsterixDB.
  * A rich set of primitive data types, including support for spatial,
temporal, and textual data.
  * Indexing options that include B+ trees, R trees, and inverted
keyword index support.
  * Basic transactional (concurrency and recovery) capabilities akin to
those of a NoSQL store.
 
 
  Background and Rationale
 
  In the world of relational databases, the need to tackle data volumes
  that exceed the capabilities of a single server led to the
  development of “shared-nothing” parallel database systems several
  decades ago. These systems spread data over a cluster based on a
  partitioning strategy, such as hash partitioning, and queries are
  processed by employing partitioned-parallel divide-and-conquer
  techniques. Since these systems are fronted by a high-level,
  declarative language (SQL), their users are shielded from the
  complexities of parallel programming. Parallel database systems have
  been an extremely successful application of parallel computing, and
  quite a number of commercial products exist today.
 
  In the distributed systems world, the Web brought a need to index and
  query its huge content. SQL and relational databases were not the
  answer, though shared-nothing clusters again emerged as the hardware
  platform of choice. Google developed the Google File System (GFS) and
  MapReduce programming model to allow programmers to store and process
  Big Data by writing a few user-defined functions. The MapReduce
  framework applies these functions in parallel to data instances in
  distributed files (map) and to sorted groups of instances sharing a
  common key (reduce) -- not unlike the partitioned parallelism in
  parallel database systems. Apache's Hadoop MapReduce platform is the
  most prominent implementation of this paradigm for the rest of the
  Big Data community. On top of Hadoop and HDFS sit declarative
  languages like Pig and Hive that each compile down to Hadoop
  MapReduce jobs.
 
  The big Web companies were also challenged by extreme user bases
  (100s of millions of users) and needed fast simple lookups and
  updates to very large keyed data sets like user profiles. SQL
  databases were deemed either too expensive or not scalable, so the
  “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
  popular key-value stores, in this space. MongoDB and Couchbase are
  other open source alternatives (document stores).
 
  It is evident from the rapidly growing popularity of NoSQL stores,
  as well as the strong demand for Big Data analytics engines today,
  that there is a strong (and growing!) need to store, process, *and*
  query large volumes of semi-structured data in many application
  areas. Until very recently, developers have had to ``choose'' between
  using big data analytics engines like Apache Hive or Apache Spark,
  which can do complex query processing and analysis over HDFS-resident
  files, and flexible but low-function data stores like MongoDB 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-19 Thread Henry Saputra
+1 This is GREAT News!

Was watching and trying AsterixDB last year and looked in awesome shape.

I have my plate full but would love to help mentor this project to get
it going to ASF if needed!

- Henry

On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,

 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:

 https://wiki.apache.org/incubator/AsterixDBProposal


 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.

 Cheers!
 Chris Mattmann

 =
 Apache AsterixDB Proposal

 Abstract

 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.

 Proposal

 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:

 * A NoSQL style data model (ADM) based on extending JSON with object
   database concepts.
 * An expressive and declarative query language (AQL) for querying
   semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
   execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
   ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
   well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
   temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
   keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
   those of a NoSQL store.


 Background and Rationale

 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.

 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.

 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document stores).

 It is evident from the rapidly growing popularity of NoSQL stores,
 as well as the strong demand for Big Data analytics engines today,
 that there is a strong (and growing!) need to store, process, *and*
 query large volumes of semi-structured data in many application
 areas. Until very recently, developers have had to ``choose'' between
 using big data analytics engines like Apache Hive or Apache Spark,
 which can do complex query processing and analysis over HDFS-resident
 files, and flexible but low-function data stores like MongoDB or
 Apache HBase. (The Apache Phoenix project,
 http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
 aims to bridge between these choices.)

 AsterixDB is a highly scalable data management system that can store,
 index, and manage semi-structured data, e.g., much like MongoDB, but
 it also supports a full-power query language with the 

Re: [PROPOSAL] Apache AsterixDB Incubator

2015-01-14 Thread Till Westmann
Hi,

if you read the proposal all the way to the end you will see that - while we do 
have some community and code - we don’t have mentors.
So if you like the proposal, please volunteer.

Cheers,
Till

 On Jan 14, 2015, at 6:21 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
 Hi Folks,
 
 I am pleased to bring forth the Apache AsterixDB proposal to the
 Apache Incubator as Champion, working in collaboration with the
 team. Please find the wiki proposal here:
 
 https://wiki.apache.org/incubator/AsterixDBProposal
 
 
 Full text of the proposal is below. Please discuss and enjoy. I’ll
 leave the discussion open for a week, and then look to call a VOTE
 hopefully end of next week if all is well.
 
 Cheers!
 Chris Mattmann
 
 =
 Apache AsterixDB Proposal
 
 Abstract
 
 Apache AsterixDB is a scalable big data management system (BDMS) that
 provides storage, management, and query capabilities for large
 collections of semi-structured data.
 
 Proposal
 
 AsterixDB is a big data management system (BDMS) that makes it
 well-suited to needs such as web data warehousing and social data
 storage and analysis. Feature-wise, AsterixDB has:
 
 * A NoSQL style data model (ADM) based on extending JSON with object
  database concepts.
 * An expressive and declarative query language (AQL) for querying
  semi-structured data.
 * A runtime query execution engine, Hyracks, for partitioned-parallel
  execution of query plans.
 * Partitioned LSM-based data storage and indexing for efficient
  ingestion of newly arriving data.
 * Support for querying and indexing external data (e.g., in HDFS) as
  well as data stored within AsterixDB.
 * A rich set of primitive data types, including support for spatial,
  temporal, and textual data.
 * Indexing options that include B+ trees, R trees, and inverted
  keyword index support.
 * Basic transactional (concurrency and recovery) capabilities akin to
  those of a NoSQL store.
 
 
 Background and Rationale
 
 In the world of relational databases, the need to tackle data volumes
 that exceed the capabilities of a single server led to the
 development of “shared-nothing” parallel database systems several
 decades ago. These systems spread data over a cluster based on a
 partitioning strategy, such as hash partitioning, and queries are
 processed by employing partitioned-parallel divide-and-conquer
 techniques. Since these systems are fronted by a high-level,
 declarative language (SQL), their users are shielded from the
 complexities of parallel programming. Parallel database systems have
 been an extremely successful application of parallel computing, and
 quite a number of commercial products exist today.
 
 In the distributed systems world, the Web brought a need to index and
 query its huge content. SQL and relational databases were not the
 answer, though shared-nothing clusters again emerged as the hardware
 platform of choice. Google developed the Google File System (GFS) and
 MapReduce programming model to allow programmers to store and process
 Big Data by writing a few user-defined functions. The MapReduce
 framework applies these functions in parallel to data instances in
 distributed files (map) and to sorted groups of instances sharing a
 common key (reduce) -- not unlike the partitioned parallelism in
 parallel database systems. Apache's Hadoop MapReduce platform is the
 most prominent implementation of this paradigm for the rest of the
 Big Data community. On top of Hadoop and HDFS sit declarative
 languages like Pig and Hive that each compile down to Hadoop
 MapReduce jobs.
 
 The big Web companies were also challenged by extreme user bases
 (100s of millions of users) and needed fast simple lookups and
 updates to very large keyed data sets like user profiles. SQL
 databases were deemed either too expensive or not scalable, so the
 “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
 popular key-value stores, in this space. MongoDB and Couchbase are
 other open source alternatives (document stores).
 
 It is evident from the rapidly growing popularity of NoSQL stores,
 as well as the strong demand for Big Data analytics engines today,
 that there is a strong (and growing!) need to store, process, *and*
 query large volumes of semi-structured data in many application
 areas. Until very recently, developers have had to ``choose'' between
 using big data analytics engines like Apache Hive or Apache Spark,
 which can do complex query processing and analysis over HDFS-resident
 files, and flexible but low-function data stores like MongoDB or
 Apache HBase. (The Apache Phoenix project,
 http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
 aims to bridge between these choices.)
 
 AsterixDB is a highly scalable data management system that can store,
 index, and manage semi-structured data, e.g., much like MongoDB, but
 it also supports a full-power query language with 

[PROPOSAL] Apache AsterixDB Incubator

2015-01-14 Thread Mattmann, Chris A (3980)
Hi Folks,

I am pleased to bring forth the Apache AsterixDB proposal to the
Apache Incubator as Champion, working in collaboration with the
team. Please find the wiki proposal here:

https://wiki.apache.org/incubator/AsterixDBProposal


Full text of the proposal is below. Please discuss and enjoy. I’ll
leave the discussion open for a week, and then look to call a VOTE
hopefully end of next week if all is well.

Cheers!
Chris Mattmann

=
Apache AsterixDB Proposal

Abstract

Apache AsterixDB is a scalable big data management system (BDMS) that
provides storage, management, and query capabilities for large
collections of semi-structured data.

Proposal

AsterixDB is a big data management system (BDMS) that makes it
well-suited to needs such as web data warehousing and social data
storage and analysis. Feature-wise, AsterixDB has:

* A NoSQL style data model (ADM) based on extending JSON with object
  database concepts.
* An expressive and declarative query language (AQL) for querying
  semi-structured data.
* A runtime query execution engine, Hyracks, for partitioned-parallel
  execution of query plans.
* Partitioned LSM-based data storage and indexing for efficient
  ingestion of newly arriving data.
* Support for querying and indexing external data (e.g., in HDFS) as
  well as data stored within AsterixDB.
* A rich set of primitive data types, including support for spatial,
  temporal, and textual data.
* Indexing options that include B+ trees, R trees, and inverted
  keyword index support.
* Basic transactional (concurrency and recovery) capabilities akin to
  those of a NoSQL store.


Background and Rationale

In the world of relational databases, the need to tackle data volumes
that exceed the capabilities of a single server led to the
development of “shared-nothing” parallel database systems several
decades ago. These systems spread data over a cluster based on a
partitioning strategy, such as hash partitioning, and queries are
processed by employing partitioned-parallel divide-and-conquer
techniques. Since these systems are fronted by a high-level,
declarative language (SQL), their users are shielded from the
complexities of parallel programming. Parallel database systems have
been an extremely successful application of parallel computing, and
quite a number of commercial products exist today.

In the distributed systems world, the Web brought a need to index and
query its huge content. SQL and relational databases were not the
answer, though shared-nothing clusters again emerged as the hardware
platform of choice. Google developed the Google File System (GFS) and
MapReduce programming model to allow programmers to store and process
Big Data by writing a few user-defined functions. The MapReduce
framework applies these functions in parallel to data instances in
distributed files (map) and to sorted groups of instances sharing a
common key (reduce) -- not unlike the partitioned parallelism in
parallel database systems. Apache's Hadoop MapReduce platform is the
most prominent implementation of this paradigm for the rest of the
Big Data community. On top of Hadoop and HDFS sit declarative
languages like Pig and Hive that each compile down to Hadoop
MapReduce jobs.

The big Web companies were also challenged by extreme user bases
(100s of millions of users) and needed fast simple lookups and
updates to very large keyed data sets like user profiles. SQL
databases were deemed either too expensive or not scalable, so the
“NoSQL movement” was born. The ASF now has HBase and Cassandra, two
popular key-value stores, in this space. MongoDB and Couchbase are
other open source alternatives (document stores).

It is evident from the rapidly growing popularity of NoSQL stores,
as well as the strong demand for Big Data analytics engines today,
that there is a strong (and growing!) need to store, process, *and*
query large volumes of semi-structured data in many application
areas. Until very recently, developers have had to ``choose'' between
using big data analytics engines like Apache Hive or Apache Spark,
which can do complex query processing and analysis over HDFS-resident
files, and flexible but low-function data stores like MongoDB or
Apache HBase. (The Apache Phoenix project,
http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
aims to bridge between these choices.)

AsterixDB is a highly scalable data management system that can store,
index, and manage semi-structured data, e.g., much like MongoDB, but
it also supports a full-power query language with the expressiveness
of SQL (and more). Unlike analytics engines like Hive or Spark, it
stores and manages data, so AsterixDB can exploit its knowledge of
data partitioning and the availability of indexes to avoid always
scanning data set(s) to process queries. Somewhat surprisingly, there
is no open source parallel database system (relational or otherwise)
available to developers today --