I am not sure why it isn't mirrored to git.a.o yet, but Jim's answer is correct.
On Mon, Sep 3, 2012 at 5:49 PM, Jim Donofrio <[email protected]> wrote: > use > https://git-wip-us.apache.org/**repos/asf/incubator-drill.git<https://git-wip-us.apache.org/repos/asf/incubator-drill.git> > > On 09/03/2012 05:22 PM, Michael Hausenblas wrote: > >> Ted, >> >> First commit >>> >> Cool ;) >> >> Tried to clone and got: >> >> git clone git://git-wip-us.apache.org/**repos/asf?p=incubator-drill.** >> git <http://git-wip-us.apache.org/repos/asf?p=incubator-drill.git> repo >> Cloning into repo... >> git-wip-us.apache.org[0: 140.211.11.121]: errno=Operation timed out >> fatal: unable to connect a socket (Operation timed out) >> >> Also, it seems to not been listed on http://git.apache.org/ yet - could >> that be the reason for me not being able to clone it? >> >> Cheers, >> Michael >> >> -- >> Michael Hausenblas >> Ireland, Europe >> http://mhausenblas.info/ >> >> On 3 Sep 2012, at 22:09, [email protected] wrote: >> >> Updated Branches: >>> refs/heads/master [created] 9229caa45 >>> >>> >>> First commit >>> >>> Project: >>> http://git-wip-us.apache.org/**repos/asf/incubator-drill/repo<http://git-wip-us.apache.org/repos/asf/incubator-drill/repo> >>> Commit: http://git-wip-us.apache.org/**repos/asf/incubator-drill/** >>> commit/9229caa4<http://git-wip-us.apache.org/repos/asf/incubator-drill/commit/9229caa4> >>> Tree: http://git-wip-us.apache.org/**repos/asf/incubator-drill/** >>> tree/9229caa4<http://git-wip-us.apache.org/repos/asf/incubator-drill/tree/9229caa4> >>> Diff: http://git-wip-us.apache.org/**repos/asf/incubator-drill/** >>> diff/9229caa4<http://git-wip-us.apache.org/repos/asf/incubator-drill/diff/9229caa4> >>> >>> Branch: refs/heads/master >>> Commit: 9229caa45a32dc06625f2443b6a5d8**4ab0a4df10 >>> Parents: >>> Author: Ted Dunning <[email protected]> >>> Authored: Mon Sep 3 13:21:32 2012 -0700 >>> Committer: Ted Dunning <[email protected]> >>> Committed: Mon Sep 3 13:21:32 2012 -0700 >>> >>> ------------------------------**------------------------------** >>> ---------- >>> README.md | 127 ++++++++++++++++++++++++++++++** >>> ++++++++++++++++++++++++++ >>> 1 files changed, 127 insertions(+), 0 deletions(-) >>> ------------------------------**------------------------------** >>> ---------- >>> >>> >>> http://git-wip-us.apache.org/**repos/asf/incubator-drill/** >>> blob/9229caa4/README.md<http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/9229caa4/README.md> >>> ------------------------------**------------------------------** >>> ---------- >>> diff --git a/README.md b/README.md >>> new file mode 100644 >>> index 0000000..51772a9 >>> --- /dev/null >>> +++ b/README.md >>> @@ -0,0 +1,127 @@ >>> += Drill = >>> + >>> +This is a copy of the original proposal for Drill, for now. Please >>> edit and update as appropriate. >>> + >>> +== Abstract == >>> +Drill is a distributed system for interactive analysis of large-scale >>> datasets, inspired by [[http://research.google.com/** >>> pubs/pub36632.html|Google's<http://research.google.com/pubs/pub36632.html%7CGoogle's>Dremel]]. >>> + >>> +== Proposal == >>> +Drill is a distributed system for interactive analysis of large-scale >>> datasets. Drill is similar to Google's Dremel, with the additional >>> flexibility needed to support a broader range of query languages, data >>> formats and data sources. It is designed to efficiently process nested >>> data. It is a design goal to scale to 10,000 servers or more and to be able >>> to process petabyes of data and trillions of records in seconds. >>> + >>> +== Background == >>> +Many organizations have the need to run data-intensive applications, >>> including batch processing, stream processing and interactive analysis. In >>> recent years open source systems have emerged to address the need for >>> scalable batch processing (Apache Hadoop) and stream processing (Storm, >>> Apache S4). In 2010 Google published a paper called "Dremel: Interactive >>> Analysis of Web-Scale Datasets," describing a scalable system used >>> internally for interactive analysis of nested data. No open source project >>> has successfully replicated the capabilities of Dremel. >>> + >>> +== Rationale == >>> +There is a strong need in the market for low-latency interactive >>> analysis of large-scale datasets, including nested data (eg, JSON, Avro, >>> Protocol Buffers). This need was identified by Google and addressed >>> internally with a system called Dremel. >>> + >>> +In recent years open source systems have emerged to address the need >>> for scalable batch processing (Apache Hadoop) and stream processing (Storm, >>> Apache S4). Apache Hadoop, originally inspired by Google's internal >>> MapReduce system, is used by thousands of organizations processing >>> large-scale datasets. Apache Hadoop is designed to achieve very high >>> throughput, but is not designed to achieve the sub-second latency needed >>> for interactive data analysis and exploration. Drill, inspired by Google's >>> internal Dremel system, is intended to address this need. >>> + >>> +It is worth noting that, as explained by Google in the original paper, >>> Dremel complements MapReduce-based computing. Dremel is not intended as a >>> replacement for MapReduce and is often used in conjunction with it to >>> analyze outputs of MapReduce pipelines or rapidly prototype larger >>> computations. Indeed, Dremel and MapReduce are both used by thousands of >>> Google employees. >>> + >>> +Like Dremel, Drill supports a nested data model with data encoded in a >>> number of formats such as JSON, Avro or Protocol Buffers. In many >>> organizations nested data is the standard, so supporting a nested data >>> model eliminates the need to normalize the data. With that said, flat data >>> formats, such as CSV files, are naturally supported as a special case of >>> nested data. >>> + >>> +The Drill architecture consists of four key components/layers: >>> + * Query languages: This layer is responsible for parsing the user's >>> query and constructing an execution plan. The initial goal is to support >>> the SQL-like language used by Dremel and [[https://developers.google.** >>> com/bigquery/docs/query-**reference|Google<https://developers.google.com/bigquery/docs/query-reference%7CGoogle>BigQuery]], >>> which we call DrQL. However, Drill is designed to support other >>> languages and programming models, such as the [[http://www.mongodb.org/* >>> *display/DOCS/Mongo+Query+**Language|Mongo<http://www.mongodb.org/display/DOCS/Mongo+Query+Language%7CMongo>Query >>> Language]], [[ >>> http://www.cascading.org/|**Cascading<http://www.cascading.org/%7CCascading>]] >>> or >>> [[https://github.com/tdunning/**Plume|Plume]<https://github.com/tdunning/Plume%7CPlume%5D> >>> ]. >>> + * Low-latency distributed execution engine: This layer is responsible >>> for executing the physical plan. It provides the scalability and fault >>> tolerance needed to efficiently query petabytes of data on 10,000 servers. >>> Drill's execution engine is based on research in distributed execution >>> engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar >>> storage, and can be extended with additional operators and connectors. >>> + * Nested data formats: This layer is responsible for supporting >>> various data formats. The initial goal is to support the column-based >>> format used by Dremel. Drill is designed to support schema-based formats >>> such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and >>> schema-less formats such as JSON, BSON or YAML. In addition, it is designed >>> to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, >>> and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A >>> particular distinction with Drill is that the execution engine is flexible >>> enough to support column-based processing as well as row-based processing. >>> This is important because column-based processing can be much more >>> efficient when the data is stored in a column-based format, but many large >>> data assets are stored in a row-based format that would require conversion >>> before use. >>> + * Scalable data sources: This layer is responsible for supporting >>> various data sources. The initial focus is to leverage Hadoop as a data >>> source. >>> + >>> +It is worth noting that no open source project has successfully >>> replicated the capabilities of Dremel, nor have any taken on the broader >>> goals of flexibility (eg, pluggable query languages, data formats, data >>> sources and execution engine operators/connectors) that are part of Drill. >>> + >>> +== Initial Goals == >>> +The initial goals for this project are to specify the detailed >>> requirements and architecture, and then develop the initial implementation >>> including the execution engine and DrQL. >>> +Like Apache Hadoop, which was built to support multiple storage systems >>> (through the FileSystem API) and file formats (through the >>> InputFormat/OutputFormat APIs), Drill will be built to support multiple >>> query languages, data formats and data sources. The initial implementation >>> of Drill will support the DrQL and a column-based format similar to Dremel. >>> + >>> +== Current Status == >>> +Significant work has been completed to identify the initial >>> requirements and define the overall system architecture. The next step is >>> to implement the four components described in the Rationale section, and we >>> intend to do that development as an Apache project. >>> + >>> +=== Meritocracy === >>> +We plan to invest in supporting a meritocracy. We will discuss the >>> requirements in an open forum. Several companies have already expressed >>> interest in this project, and we intend to invite additional developers to >>> participate. We will encourage and monitor community participation so that >>> privileges can be extended to those that contribute. Also, Drill has an >>> extensible/pluggable architecture that encourages developers to contribute >>> various extensions, such as query languages, data formats, data sources and >>> execution engine operators and connectors. While some companies will surely >>> develop commercial extensions, we also anticipate that some companies and >>> individuals will want to contribute such extensions back to the project, >>> and we look forward to fostering a rich ecosystem of extensions. >>> + >>> +=== Community === >>> +The need for a system for interactive analysis of large datasets in the >>> open source is tremendous, so there is a potential for a very large >>> community. We believe that Drill's extensible architecture will further >>> encourage community participation. Also, related Apache projects (eg, >>> Hadoop) have very large and active communities, and we expect that over >>> time Drill will also attract a large community. >>> + >>> +=== Core Developers === >>> +The developers on the initial committers list include experienced >>> distributed systems engineers: >>> + * Tomer Shiran has experience developing distributed execution >>> engines. He developed Parallel DataSeries, a data-parallel version of the >>> open source >>> [[http://tesla.hpl.hp.com/**opensource/|DataSeries<http://tesla.hpl.hp.com/opensource/%7CDataSeries>]] >>> system. He is also the author of Applying Idealized Lower-bound Runtime >>> Models to Understand Inefficiencies in Data-intensive Computing (SIGMETRICS >>> 2011). Tomer worked as a software developer and researcher at IBM Research, >>> Microsoft and HP Labs, and is now at MapR Technologies. He has been active >>> in the Hadoop community since 2009. >>> + * Jason Frantz was at Clustrix, where he designed and developed the >>> first scale-out SQL database based on MySQL. Jason developed the >>> distributed query optimizer that powered Clustrix. He is now a software >>> engineer and architect at MapR Technologies. >>> + * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, >>> and has a history of over 30 years of contributions to open source. He is >>> now at MapR Technologies. Ted has been very active in the Hadoop community >>> since the project's early days. >>> + * MC Srivas is the co-founder and CTO of MapR Technologies. While at >>> Google he worked on Google's scalable search infrastructure. MC Srivas has >>> been active in the Hadoop community since 2009. >>> + * Chris Wensel is the founder and CEO of Concurrent. Prior to founding >>> Concurrent, he developed Cascading, an Apache-licensed open source >>> application framework enabling Java developers to quickly and easily >>> develop robust Data Analytics and Data Management applications on Apache >>> Hadoop. Chris has been involved in the Hadoop community since the project's >>> early days. >>> + * Keys Botzum was at IBM, where he worked on security and distributed >>> systems, and is currently at MapR Technologies. >>> + * Gera Shegalov was at Oracle, where he worked on networking, storage >>> and database kernels, and is currently at MapR Technologies. >>> + * Ryan Rawson is the VP Engineering of Drawn to Scale where he >>> developed Spire, a real-time operational database for Hadoop. He is also a >>> committer and PMC member for Apache HBase, and has a long history of >>> contributions to open source. Ryan has been involved in the Hadoop >>> community since the project's early days. >>> + >>> +We realize that additional employer diversity is needed, and we will >>> work aggressively to recruit developers from additional companies. >>> + >>> +=== Alignment === >>> +The initial committers strongly believe that a system for interactive >>> analysis of large-scale datasets will gain broader adoption as an open >>> source, community driven project, where the community can contribute not >>> only to the core components, but also to a growing collection of query >>> languages and optimizers, data formats, data formats, and execution engine >>> operators and connectors. Drill will integrate closely with Apache Hadoop. >>> First, the data will live in Hadoop. That is, Drill will support Hadoop >>> FileSystem implementations and HBase. Second, Hadoop-related data formats >>> will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools >>> will be provided to produce column-based formats. Fourth, Drill tables can >>> be registered in HCatalog. Finally, Hive is being considered as the basis >>> of the DrQL implementation. >>> + >>> +== Known Risks == >>> + >>> +=== Orphaned Products === >>> +The contributors are leading vendors in this space, with significant >>> open source experience, so the risk of being orphaned is relatively low. >>> The project could be at risk if vendors decided to change their strategies >>> in the market. In such an event, the current committers plan to continue >>> working on the project on their own time, though the progress will likely >>> be slower. We plan to mitigate this risk by recruiting additional >>> committers. >>> + >>> +=== Inexperience with Open Source === >>> +The initial committers include veteran Apache members (committers and >>> PMC members) and other developers who have varying degrees of experience >>> with open source projects. All have been involved with source code that has >>> been released under an open source license, and several also have >>> experience developing code with an open source development process. >>> + >>> +=== Homogenous Developers === >>> +The initial committers are employed by a number of companies, including >>> MapR Technologies, Concurrent and Drawn to Scale. We are committed to >>> recruiting additional committers from other companies. >>> + >>> +=== Reliance on Salaried Developers === >>> +It is expected that Drill development will occur on both salaried time >>> and on volunteer time, after hours. The majority of initial committers are >>> paid by their employer to contribute to this project. However, they are all >>> passionate about the project, and we are confident that the project will >>> continue even if no salaried developers contribute to the project. We are >>> committed to recruiting additional committers including non-salaried >>> developers. >>> + >>> +=== Relationships with Other Apache Products === >>> +As mentioned in the Alignment section, Drill is closely integrated with >>> Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data >>> lives inside a Hadoop environment (Drill operates on in situ data). We look >>> forward to collaborating with those communities, as well as other Apache >>> communities. >>> + >>> +=== An Excessive Fascination with the Apache Brand === >>> +Drill solves a real problem that many organizations struggle with, and >>> has been proven within Google to be of significant value. The architecture >>> is based on academic and industry research. Our rationale for developing >>> Drill as an Apache project is detailed in the Rationale section. We believe >>> that the Apache brand and community process will help us attract more >>> contributors to this project, and help establish ubiquitous APIs. In >>> addition, establishing consensus among users and developers of a >>> Dremel-like tool is a key requirement for success of the project. >>> + >>> +== Documentation == >>> +Drill is inspired by Google's Dremel. Google has published a [[ >>> http://research.google.com/**pubs/pub36632.html|paper<http://research.google.com/pubs/pub36632.html%7Cpaper>]] >>> highlighting Dremel's innovative nested column-based data format and >>> execution engine. >>> + >>> +== Initial Source == >>> +The requirement and design documents are currently stored in MapR >>> Technologies' source code repository. They will be checked in as part of >>> the initial code dump. Check out the [[attachment:Drill slides.pdf|attached >>> slides]]. >>> + >>> +== Cryptography == >>> +Drill will eventually support encryption on the wire. This is not one >>> of the initial goals, and we do not expect Drill to be a controlled export >>> item due to the use of encryption. >>> + >>> +== Required Resources == >>> + >>> +=== Mailing List === >>> + * drill-private >>> + * drill-dev >>> + * drill-user >>> + >>> +=== Subversion Directory === >>> +Git is the preferred source control system: git://git.apache.org/drill >>> + >>> +=== Issue Tracking === >>> +JIRA Drill (DRILL) >>> + >>> +== Initial Committers == >>> + * Tomer Shiran <tshiran at maprtech dot com> >>> + * Ted Dunning <tdunning at apache dot org> >>> + * Jason Frantz <jfrantz at maprtech dot com> >>> + * MC Srivas <mcsrivas at maprtech dot com> >>> + * Chris Wensel <chris and concurrentinc dot com> >>> + * Keys Botzum <kbotzum at maprtech dot com> >>> + * Gera Shegalov <gshegalov at maprtech dot com> >>> + * Ryan Rawson <ryan at drawntoscale dot com> >>> + >>> +== Affiliations == >>> +The initial committers are employees of MapR Technologies, Drawn to >>> Scale and Concurrent. The nominated mentors are employees of MapR >>> Technologies, Lucid Imagination and Nokia. >>> + >>> +== Sponsors == >>> + >>> +=== Champion === >>> +Ted Dunning (tdunning at apache dot org) >>> + >>> +=== Nominated Mentors === >>> + * Ted Dunning <tdunning at apache dot org> – Chief Application >>> Architect at MapR Technologies, Committer for Lucene, Mahout and ZooKeeper. >>> + * Grant Ingersoll <grant at lucidimagination dot com> – Chief >>> Scientist at Lucid Imagination, Committer for Lucene, Mahout and other >>> projects. >>> + * Isabel Drost <isabel at apache dot org> – Software Developer at >>> Nokia Gate 5 GmbH, Committer for Lucene, Mahout and other projects. >>> + >>> +=== Sponsoring Entity === >>> +Incubator >>> + >>> >>> >> >
