Yes. This was just a test to see if git was working. On Mon, Sep 3, 2012 at 6:41 PM, 周大 <[email protected]> wrote:
> Is there only README? > > > Da Zhou (周大) > ------------------------------------------------ > China Mobile Research Institute > No.32 Xuanwumen West Street, > XiCheng District > Beijing 100053, P.R. China > Mobile: +86 13910309724 > Office: +86 15801696688 - 35989 > ------------------------------------------------- > > -----邮件原件----- > 发件人: Michael Hausenblas [mailto:[email protected]] > 发送时间: 2012年9月4日 5:23 > 收件人: [email protected] > 主题: Re: git commit: First commit > > Ted, > > > First commit > > Cool ;) > > Tried to clone and got: > > git clone git://git-wip-us.apache.org/repos/asf?p=incubator-drill.gitrepo > Cloning into repo... > git-wip-us.apache.org[0: 140.211.11.121]: errno=Operation timed out > fatal: unable to connect a socket (Operation timed out) > > Also, it seems to not been listed on http://git.apache.org/ yet - could > that > be the reason for me not being able to clone it? > > Cheers, > Michael > > -- > Michael Hausenblas > Ireland, Europe > http://mhausenblas.info/ > > On 3 Sep 2012, at 22:09, [email protected] wrote: > > > Updated Branches: > > refs/heads/master [created] 9229caa45 > > > > > > First commit > > > > Project: http://git-wip-us.apache.org/repos/asf/incubator-drill/repo > > Commit: > http://git-wip-us.apache.org/repos/asf/incubator-drill/commit/9229caa4 > > Tree: > http://git-wip-us.apache.org/repos/asf/incubator-drill/tree/9229caa4 > > Diff: > http://git-wip-us.apache.org/repos/asf/incubator-drill/diff/9229caa4 > > > > Branch: refs/heads/master > > Commit: 9229caa45a32dc06625f2443b6a5d84ab0a4df10 > > Parents: > > Author: Ted Dunning <[email protected]> > > Authored: Mon Sep 3 13:21:32 2012 -0700 > > Committer: Ted Dunning <[email protected]> > > Committed: Mon Sep 3 13:21:32 2012 -0700 > > > > ---------------------------------------------------------------------- > > README.md | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 files changed, 127 insertions(+), 0 deletions(-) > > ---------------------------------------------------------------------- > > > > > > > http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/9229caa4/README > . > md > > ---------------------------------------------------------------------- > > diff --git a/README.md b/README.md > > new file mode 100644 > > index 0000000..51772a9 > > --- /dev/null > > +++ b/README.md > > @@ -0,0 +1,127 @@ > > += Drill = > > + > > +This is a copy of the original proposal for Drill, for now. Please edit > and update as appropriate. > > + > > +== Abstract == > > +Drill is a distributed system for interactive analysis of large-scale > datasets, inspired by > [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. > > + > > +== Proposal == > > +Drill is a distributed system for interactive analysis of large-scale > datasets. Drill is similar to Google's Dremel, with the additional > flexibility needed to support a broader range of query languages, data > formats and data sources. It is designed to efficiently process nested > data. > It is a design goal to scale to 10,000 servers or more and to be able to > process petabyes of data and trillions of records in seconds. > > + > > +== Background == > > +Many organizations have the need to run data-intensive applications, > including batch processing, stream processing and interactive analysis. In > recent years open source systems have emerged to address the need for > scalable batch processing (Apache Hadoop) and stream processing (Storm, > Apache S4). In 2010 Google published a paper called "Dremel: Interactive > Analysis of Web-Scale Datasets," describing a scalable system used > internally for interactive analysis of nested data. No open source project > has successfully replicated the capabilities of Dremel. > > + > > +== Rationale == > > +There is a strong need in the market for low-latency interactive > analysis > of large-scale datasets, including nested data (eg, JSON, Avro, Protocol > Buffers). This need was identified by Google and addressed internally with > a > system called Dremel. > > + > > +In recent years open source systems have emerged to address the need for > scalable batch processing (Apache Hadoop) and stream processing (Storm, > Apache S4). Apache Hadoop, originally inspired by Google's internal > MapReduce system, is used by thousands of organizations processing > large-scale datasets. Apache Hadoop is designed to achieve very high > throughput, but is not designed to achieve the sub-second latency needed > for > interactive data analysis and exploration. Drill, inspired by Google's > internal Dremel system, is intended to address this need. > > + > > +It is worth noting that, as explained by Google in the original paper, > Dremel complements MapReduce-based computing. Dremel is not intended as a > replacement for MapReduce and is often used in conjunction with it to > analyze outputs of MapReduce pipelines or rapidly prototype larger > computations. Indeed, Dremel and MapReduce are both used by thousands of > Google employees. > > + > > +Like Dremel, Drill supports a nested data model with data encoded in a > number of formats such as JSON, Avro or Protocol Buffers. In many > organizations nested data is the standard, so supporting a nested data > model > eliminates the need to normalize the data. With that said, flat data > formats, such as CSV files, are naturally supported as a special case of > nested data. > > + > > +The Drill architecture consists of four key components/layers: > > + * Query languages: This layer is responsible for parsing the user's > query and constructing an execution plan. The initial goal is to support > the SQL-like language used by Dremel and > [[https://developers.google.com/bigquery/docs/query-reference|Google > BigQuery]], which we call DrQL. However, Drill is designed to support other > languages and programming models, such as the > [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query > Language]], [[http://www.cascading.org/|Cascading]] or > [[https://github.com/tdunning/Plume|Plume]]. > > + * Low-latency distributed execution engine: This layer is responsible > for executing the physical plan. It provides the scalability and fault > tolerance needed to efficiently query petabytes of data on 10,000 servers. > Drill's execution engine is based on research in distributed execution > engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar > storage, and can be extended with additional operators and connectors. > > + * Nested data formats: This layer is responsible for supporting various > data formats. The initial goal is to support the column-based format used > by > Dremel. Drill is designed to support schema-based formats such as Protocol > Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such > as JSON, BSON or YAML. In addition, it is designed to support column-based > formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats > such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular > distinction > with Drill is that the execution engine is flexible enough to support > column-based processing as well as row-based processing. This is important > because column-based processing can be much more efficient when the data is > stored in a column-based format, but many large data assets are stored in a > row-based format that would require conversion before use. > > + * Scalable data sources: This layer is responsible for supporting > various data sources. The initial focus is to leverage Hadoop as a data > source. > > + > > +It is worth noting that no open source project has successfully > replicated the capabilities of Dremel, nor have any taken on the broader > goals of flexibility (eg, pluggable query languages, data formats, data > sources and execution engine operators/connectors) that are part of Drill. > > + > > +== Initial Goals == > > +The initial goals for this project are to specify the detailed > requirements and architecture, and then develop the initial implementation > including the execution engine and DrQL. > > +Like Apache Hadoop, which was built to support multiple storage systems > (through the FileSystem API) and file formats (through the > InputFormat/OutputFormat APIs), Drill will be built to support multiple > query languages, data formats and data sources. The initial implementation > of Drill will support the DrQL and a column-based format similar to Dremel. > > + > > +== Current Status == > > +Significant work has been completed to identify the initial requirements > and define the overall system architecture. The next step is to implement > the four components described in the Rationale section, and we intend to do > that development as an Apache project. > > + > > +=== Meritocracy === > > +We plan to invest in supporting a meritocracy. We will discuss the > requirements in an open forum. Several companies have already expressed > interest in this project, and we intend to invite additional developers to > participate. We will encourage and monitor community participation so that > privileges can be extended to those that contribute. Also, Drill has an > extensible/pluggable architecture that encourages developers to contribute > various extensions, such as query languages, data formats, data sources and > execution engine operators and connectors. While some companies will surely > develop commercial extensions, we also anticipate that some companies and > individuals will want to contribute such extensions back to the project, > and > we look forward to fostering a rich ecosystem of extensions. > > + > > +=== Community === > > +The need for a system for interactive analysis of large datasets in the > open source is tremendous, so there is a potential for a very large > community. We believe that Drill's extensible architecture will further > encourage community participation. Also, related Apache projects (eg, > Hadoop) have very large and active communities, and we expect that over > time > Drill will also attract a large community. > > + > > +=== Core Developers === > > +The developers on the initial committers list include experienced > distributed systems engineers: > > + * Tomer Shiran has experience developing distributed execution engines. > He developed Parallel DataSeries, a data-parallel version of the open > source > [[http://tesla.hpl.hp.com/opensource/|DataSeries]] system. He is also the > author of Applying Idealized Lower-bound Runtime Models to Understand > Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worked > as a software developer and researcher at IBM Research, Microsoft and HP > Labs, and is now at MapR Technologies. He has been active in the Hadoop > community since 2009. > > + * Jason Frantz was at Clustrix, where he designed and developed the > first scale-out SQL database based on MySQL. Jason developed the > distributed > query optimizer that powered Clustrix. He is now a software engineer and > architect at MapR Technologies. > > + * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, > and has a history of over 30 years of contributions to open source. He is > now at MapR Technologies. Ted has been very active in the Hadoop community > since the project's early days. > > + * MC Srivas is the co-founder and CTO of MapR Technologies. While at > Google he worked on Google's scalable search infrastructure. MC Srivas has > been active in the Hadoop community since 2009. > > + * Chris Wensel is the founder and CEO of Concurrent. Prior to founding > Concurrent, he developed Cascading, an Apache-licensed open source > application framework enabling Java developers to quickly and easily > develop > robust Data Analytics and Data Management applications on Apache Hadoop. > Chris has been involved in the Hadoop community since the project's early > days. > > + * Keys Botzum was at IBM, where he worked on security and distributed > systems, and is currently at MapR Technologies. > > + * Gera Shegalov was at Oracle, where he worked on networking, storage > and database kernels, and is currently at MapR Technologies. > > + * Ryan Rawson is the VP Engineering of Drawn to Scale where he > developed > Spire, a real-time operational database for Hadoop. He is also a committer > and PMC member for Apache HBase, and has a long history of contributions to > open source. Ryan has been involved in the Hadoop community since the > project's early days. > > + > > +We realize that additional employer diversity is needed, and we will > work > aggressively to recruit developers from additional companies. > > + > > +=== Alignment === > > +The initial committers strongly believe that a system for interactive > analysis of large-scale datasets will gain broader adoption as an open > source, community driven project, where the community can contribute not > only to the core components, but also to a growing collection of query > languages and optimizers, data formats, data formats, and execution engine > operators and connectors. Drill will integrate closely with Apache Hadoop. > First, the data will live in Hadoop. That is, Drill will support Hadoop > FileSystem implementations and HBase. Second, Hadoop-related data formats > will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools > will be provided to produce column-based formats. Fourth, Drill tables can > be registered in HCatalog. Finally, Hive is being considered as the basis > of > the DrQL implementation. > > + > > +== Known Risks == > > + > > +=== Orphaned Products === > > +The contributors are leading vendors in this space, with significant > open > source experience, so the risk of being orphaned is relatively low. The > project could be at risk if vendors decided to change their strategies in > the market. In such an event, the current committers plan to continue > working on the project on their own time, though the progress will likely > be > slower. We plan to mitigate this risk by recruiting additional committers. > > + > > +=== Inexperience with Open Source === > > +The initial committers include veteran Apache members (committers and > PMC > members) and other developers who have varying degrees of experience with > open source projects. All have been involved with source code that has been > released under an open source license, and several also have experience > developing code with an open source development process. > > + > > +=== Homogenous Developers === > > +The initial committers are employed by a number of companies, including > MapR Technologies, Concurrent and Drawn to Scale. We are committed to > recruiting additional committers from other companies. > > + > > +=== Reliance on Salaried Developers === > > +It is expected that Drill development will occur on both salaried time > and on volunteer time, after hours. The majority of initial committers are > paid by their employer to contribute to this project. However, they are all > passionate about the project, and we are confident that the project will > continue even if no salaried developers contribute to the project. We are > committed to recruiting additional committers including non-salaried > developers. > > + > > +=== Relationships with Other Apache Products === > > +As mentioned in the Alignment section, Drill is closely integrated with > Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data > lives inside a Hadoop environment (Drill operates on in situ data). We look > forward to collaborating with those communities, as well as other Apache > communities. > > + > > +=== An Excessive Fascination with the Apache Brand === > > +Drill solves a real problem that many organizations struggle with, and > has been proven within Google to be of significant value. The architecture > is based on academic and industry research. Our rationale for developing > Drill as an Apache project is detailed in the Rationale section. We believe > that the Apache brand and community process will help us attract more > contributors to this project, and help establish ubiquitous APIs. In > addition, establishing consensus among users and developers of a > Dremel-like > tool is a key requirement for success of the project. > > + > > +== Documentation == > > +Drill is inspired by Google's Dremel. Google has published a > [[http://research.google.com/pubs/pub36632.html|paper]] highlighting > Dremel's innovative nested column-based data format and execution engine. > > + > > +== Initial Source == > > +The requirement and design documents are currently stored in MapR > Technologies' source code repository. They will be checked in as part of > the > initial code dump. Check out the [[attachment:Drill slides.pdf|attached > slides]]. > > + > > +== Cryptography == > > +Drill will eventually support encryption on the wire. This is not one of > the initial goals, and we do not expect Drill to be a controlled export > item > due to the use of encryption. > > + > > +== Required Resources == > > + > > +=== Mailing List === > > + * drill-private > > + * drill-dev > > + * drill-user > > + > > +=== Subversion Directory === > > +Git is the preferred source control system: git://git.apache.org/drill > > + > > +=== Issue Tracking === > > +JIRA Drill (DRILL) > > + > > +== Initial Committers == > > + * Tomer Shiran <tshiran at maprtech dot com> > > + * Ted Dunning <tdunning at apache dot org> > > + * Jason Frantz <jfrantz at maprtech dot com> > > + * MC Srivas <mcsrivas at maprtech dot com> > > + * Chris Wensel <chris and concurrentinc dot com> > > + * Keys Botzum <kbotzum at maprtech dot com> > > + * Gera Shegalov <gshegalov at maprtech dot com> > > + * Ryan Rawson <ryan at drawntoscale dot com> > > + > > +== Affiliations == > > +The initial committers are employees of MapR Technologies, Drawn to > Scale > and Concurrent. The nominated mentors are employees of MapR Technologies, > Lucid Imagination and Nokia. > > + > > +== Sponsors == > > + > > +=== Champion === > > +Ted Dunning (tdunning at apache dot org) > > + > > +=== Nominated Mentors === > > + * Ted Dunning <tdunning at apache dot org> – Chief Application > Architect at MapR Technologies, Committer for Lucene, Mahout and ZooKeeper. > > + * Grant Ingersoll <grant at lucidimagination dot com> – Chief Scientist > at Lucid Imagination, Committer for Lucene, Mahout and other projects. > > + * Isabel Drost <isabel at apache dot org> – Software Developer at Nokia > Gate 5 GmbH, Committer for Lucene, Mahout and other projects. > > + > > +=== Sponsoring Entity === > > +Incubator > > + > > > >
