Re: [PROPOSAL] Apache AsterixDB Incubator
Thanks, Steve!! (We'd love to talk there, BTW; the challenge is doing so w/a teaching day-job. We'll see if we can find a volunteer who's not schedule-conflicted that week...!) On 1/21/15 2:44 AM, Steve Loughran wrote: +1 for the proposal: I've a lot of respect for the team...I met some of them at a workshop in Germany a few years back along with the (then) Stratosphere project. I'm would volunteer as a mentor except I'm fairly overcommitted with other things (like the slider incubating project). If it does need rounding out I'll add my name to the list. Mike: note that you have until Feb 1 to get a proposal for a paper in for ApacheCon: http://apachecon.com/ You might want to think about doing that, as it's a great way to get known by the community. -Steve On 15 January 2015 at 02:21, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or
Re: [PROPOSAL] Apache AsterixDB Incubator
+1 for the proposal: I've a lot of respect for the team...I met some of them at a workshop in Germany a few years back along with the (then) Stratosphere project. I'm would volunteer as a mentor except I'm fairly overcommitted with other things (like the slider incubating project). If it does need rounding out I'll add my name to the list. Mike: note that you have until Feb 1 to get a proposal for a paper in for ApacheCon: http://apachecon.com/ You might want to think about doing that, as it's a great way to get known by the community. -Steve On 15 January 2015 at 02:21, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache
Re: [PROPOSAL] Apache AsterixDB Incubator
Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache HBase. (The Apache Phoenix
Re: [PROPOSAL] Apache AsterixDB Incubator
On Jan 19, 2015, at 11:34 AM, jan i j...@apache.org wrote: Looks like a real challenging project, and the proposal looks as if it has already been through a couple of refinement rounds. Count on my +1, when it comes to voting. Will do! Thanks, Till rgds jan i On 19 January 2015 at 19:26, Henry Saputra henry.sapu...@gmail.com mailto:henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had
Re: [PROPOSAL] Apache AsterixDB Incubator
Thank you. So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be sufficient even for the relatively large number of new committers. Till On Jan 19, 2015, at 8:17 PM, Henry Saputra henry.sapu...@gmail.com wrote: Thanks Till, Will try to solicit more mentors to help. Especially with initial committers mostly have not been exposed to contributing the Apache way. - Henry On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote: Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for
Re: [PROPOSAL] Apache AsterixDB Incubator
Should be fine. Regards, Alan On Jan 19, 2015, at 8:27 PM, Till Westmann t...@westmann.org wrote: Thank you. So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be sufficient even for the relatively large number of new committers. Till On Jan 19, 2015, at 8:17 PM, Henry Saputra henry.sapu...@gmail.com wrote: Thanks Till, Will try to solicit more mentors to help. Especially with initial committers mostly have not been exposed to contributing the Apache way. - Henry On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote: Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document
Re: [PROPOSAL] Apache AsterixDB Incubator
Excellent; thanks, Jochen!! Cheers, Mike On 1/19/15 11:44 PM, Jochen Wiedmann wrote: Hi, Chris, I am interested in the proposal and (following up to my involvement with VXQuery in the past) would like to offer myself as a mentor. Jochen On Thu, Jan 15, 2015 at 3:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache HBase. (The Apache Phoenix project, http://phoenix.apache.org/, is a recent SQL-over-HBase effort that aims to bridge between these choices.) AsterixDB is a highly scalable data management system that can store, index, and manage semi-structured data, e.g., much like MongoDB, but it also supports a full-power query language with the expressiveness of SQL (and more). Unlike
Re: [PROPOSAL] Apache AsterixDB Incubator
Added my name to the mentor list. On Tue, Jan 20, 2015 at 8:37 AM, Mike Carey dtab...@gmail.com wrote: Wonderful; thanks, Ted!! Cheers, Mike On 1/19/15 11:29 PM, Ted Dunning wrote: Chris just asked me under separate cover. I am happy to help out as mentor. On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra henry.sapu...@gmail.com wrote: Thanks Till, Will try to solicit more mentors to help. Especially with initial committers mostly have not been exposed to contributing the Apache way. - Henry On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote: Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular
Re: [PROPOSAL] Apache AsterixDB Incubator
Ditto - thanks for the support! Cheers, Mike On 1/19/15 5:39 PM, Till Westmann wrote: On Jan 19, 2015, at 11:34 AM, jan i j...@apache.org mailto:j...@apache.org wrote: Looks like a real challenging project, and the proposal looks as if it has already been through a couple of refinement rounds. Count on my +1, when it comes to voting. Will do! Thanks, Till rgds jan i On 19 January 2015 at 19:26, Henry Saputra henry.sapu...@gmail.com mailto:henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and
Re: [PROPOSAL] Apache AsterixDB Incubator
Indeed - thanks!! Cheers, Mike On 1/19/15 5:28 PM, Till Westmann wrote: Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache HBase. (The Apache Phoenix project,
Re: [PROPOSAL] Apache AsterixDB Incubator
Thanks Till, Will try to solicit more mentors to help. Especially with initial committers mostly have not been exposed to contributing the Apache way. - Henry On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote: Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics
Re: [PROPOSAL] Apache AsterixDB Incubator
Chris just asked me under separate cover. I am happy to help out as mentor. On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra henry.sapu...@gmail.com wrote: Thanks Till, Will try to solicit more mentors to help. Especially with initial committers mostly have not been exposed to contributing the Apache way. - Henry On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann t...@westmann.org wrote: Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before. Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till On Jan 19, 2015, at 10:26 AM, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong
Re: [PROPOSAL] Apache AsterixDB Incubator
Looks like a real challenging project, and the proposal looks as if it has already been through a couple of refinement rounds. Count on my +1, when it comes to voting. rgds jan i On 19 January 2015 at 19:26, Henry Saputra henry.sapu...@gmail.com wrote: +1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB
Re: [PROPOSAL] Apache AsterixDB Incubator
+1 This is GREAT News! Was watching and trying AsterixDB last year and looked in awesome shape. I have my plate full but would love to help mentor this project to get it going to ASF if needed! - Henry On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache HBase. (The Apache Phoenix project, http://phoenix.apache.org/, is a recent SQL-over-HBase effort that aims to bridge between these choices.) AsterixDB is a highly scalable data management system that can store, index, and manage semi-structured data, e.g., much like MongoDB, but it also supports a full-power query language with the
Re: [PROPOSAL] Apache AsterixDB Incubator
Hi, if you read the proposal all the way to the end you will see that - while we do have some community and code - we don’t have mentors. So if you like the proposal, please volunteer. Cheers, Till On Jan 14, 2015, at 6:21 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache HBase. (The Apache Phoenix project, http://phoenix.apache.org/, is a recent SQL-over-HBase effort that aims to bridge between these choices.) AsterixDB is a highly scalable data management system that can store, index, and manage semi-structured data, e.g., much like MongoDB, but it also supports a full-power query language with
[PROPOSAL] Apache AsterixDB Incubator
Hi Folks, I am pleased to bring forth the Apache AsterixDB proposal to the Apache Incubator as Champion, working in collaboration with the team. Please find the wiki proposal here: https://wiki.apache.org/incubator/AsterixDBProposal Full text of the proposal is below. Please discuss and enjoy. I’ll leave the discussion open for a week, and then look to call a VOTE hopefully end of next week if all is well. Cheers! Chris Mattmann = Apache AsterixDB Proposal Abstract Apache AsterixDB is a scalable big data management system (BDMS) that provides storage, management, and query capabilities for large collections of semi-structured data. Proposal AsterixDB is a big data management system (BDMS) that makes it well-suited to needs such as web data warehousing and social data storage and analysis. Feature-wise, AsterixDB has: * A NoSQL style data model (ADM) based on extending JSON with object database concepts. * An expressive and declarative query language (AQL) for querying semi-structured data. * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans. * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data. * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB. * A rich set of primitive data types, including support for spatial, temporal, and textual data. * Indexing options that include B+ trees, R trees, and inverted keyword index support. * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. Background and Rationale In the world of relational databases, the need to tackle data volumes that exceed the capabilities of a single server led to the development of “shared-nothing” parallel database systems several decades ago. These systems spread data over a cluster based on a partitioning strategy, such as hash partitioning, and queries are processed by employing partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level, declarative language (SQL), their users are shielded from the complexities of parallel programming. Parallel database systems have been an extremely successful application of parallel computing, and quite a number of commercial products exist today. In the distributed systems world, the Web brought a need to index and query its huge content. SQL and relational databases were not the answer, though shared-nothing clusters again emerged as the hardware platform of choice. Google developed the Google File System (GFS) and MapReduce programming model to allow programmers to store and process Big Data by writing a few user-defined functions. The MapReduce framework applies these functions in parallel to data instances in distributed files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform is the most prominent implementation of this paradigm for the rest of the Big Data community. On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down to Hadoop MapReduce jobs. The big Web companies were also challenged by extreme user bases (100s of millions of users) and needed fast simple lookups and updates to very large keyed data sets like user profiles. SQL databases were deemed either too expensive or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra, two popular key-value stores, in this space. MongoDB and Couchbase are other open source alternatives (document stores). It is evident from the rapidly growing popularity of NoSQL stores, as well as the strong demand for Big Data analytics engines today, that there is a strong (and growing!) need to store, process, *and* query large volumes of semi-structured data in many application areas. Until very recently, developers have had to ``choose'' between using big data analytics engines like Apache Hive or Apache Spark, which can do complex query processing and analysis over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache HBase. (The Apache Phoenix project, http://phoenix.apache.org/, is a recent SQL-over-HBase effort that aims to bridge between these choices.) AsterixDB is a highly scalable data management system that can store, index, and manage semi-structured data, e.g., much like MongoDB, but it also supports a full-power query language with the expressiveness of SQL (and more). Unlike analytics engines like Hive or Spark, it stores and manages data, so AsterixDB can exploit its knowledge of data partitioning and the availability of indexes to avoid always scanning data set(s) to process queries. Somewhat surprisingly, there is no open source parallel database system (relational or otherwise) available to developers today --