Re: git commit: First commit

Ted Dunning Mon, 03 Sep 2012 15:11:20 -0700

I am not sure why it isn't mirrored to git.a.o yet, but Jim's answer is
correct.


On Mon, Sep 3, 2012 at 5:49 PM, Jim Donofrio <[email protected]> wrote:

> use 
> https://git-wip-us.apache.org/**repos/asf/incubator-drill.git<https://git-wip-us.apache.org/repos/asf/incubator-drill.git>
>
> On 09/03/2012 05:22 PM, Michael Hausenblas wrote:
>
>> Ted,
>>
>>  First commit
>>>
>> Cool ;)
>>
>> Tried to clone and got:
>>
>>   git clone git://git-wip-us.apache.org/**repos/asf?p=incubator-drill.**
>> git <http://git-wip-us.apache.org/repos/asf?p=incubator-drill.git> repo
>>   Cloning into repo...
>>   git-wip-us.apache.org[0: 140.211.11.121]: errno=Operation timed out
>>   fatal: unable to connect a socket (Operation timed out)
>>
>> Also, it seems to not been listed on http://git.apache.org/ yet - could
>> that be the reason for me not being able to clone it?
>>
>> Cheers,
>>            Michael
>>
>> --
>> Michael Hausenblas
>> Ireland, Europe
>> http://mhausenblas.info/
>>
>> On 3 Sep 2012, at 22:09, [email protected] wrote:
>>
>>  Updated Branches:
>>>   refs/heads/master [created] 9229caa45
>>>
>>>
>>> First commit
>>>
>>> Project: 
>>> http://git-wip-us.apache.org/**repos/asf/incubator-drill/repo<http://git-wip-us.apache.org/repos/asf/incubator-drill/repo>
>>> Commit: http://git-wip-us.apache.org/**repos/asf/incubator-drill/**
>>> commit/9229caa4<http://git-wip-us.apache.org/repos/asf/incubator-drill/commit/9229caa4>
>>> Tree: http://git-wip-us.apache.org/**repos/asf/incubator-drill/**
>>> tree/9229caa4<http://git-wip-us.apache.org/repos/asf/incubator-drill/tree/9229caa4>
>>> Diff: http://git-wip-us.apache.org/**repos/asf/incubator-drill/**
>>> diff/9229caa4<http://git-wip-us.apache.org/repos/asf/incubator-drill/diff/9229caa4>
>>>
>>> Branch: refs/heads/master
>>> Commit: 9229caa45a32dc06625f2443b6a5d8**4ab0a4df10
>>> Parents:
>>> Author: Ted Dunning <[email protected]>
>>> Authored: Mon Sep 3 13:21:32 2012 -0700
>>> Committer: Ted Dunning <[email protected]>
>>> Committed: Mon Sep 3 13:21:32 2012 -0700
>>>
>>> ------------------------------**------------------------------**
>>> ----------
>>> README.md |  127 ++++++++++++++++++++++++++++++**
>>> ++++++++++++++++++++++++++
>>> 1 files changed, 127 insertions(+), 0 deletions(-)
>>> ------------------------------**------------------------------**
>>> ----------
>>>
>>>
>>> http://git-wip-us.apache.org/**repos/asf/incubator-drill/**
>>> blob/9229caa4/README.md<http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/9229caa4/README.md>
>>> ------------------------------**------------------------------**
>>> ----------
>>> diff --git a/README.md b/README.md
>>> new file mode 100644
>>> index 0000000..51772a9
>>> --- /dev/null
>>> +++ b/README.md
>>> @@ -0,0 +1,127 @@
>>> += Drill =
>>> +
>>> +This is a copy of the original proposal for Drill, for now.  Please
>>> edit and update as appropriate.
>>> +
>>> +== Abstract ==
>>> +Drill is a distributed system for interactive analysis of large-scale
>>> datasets, inspired by [[http://research.google.com/**
>>> pubs/pub36632.html|Google's<http://research.google.com/pubs/pub36632.html%7CGoogle's>Dremel]].
>>> +
>>> +== Proposal ==
>>> +Drill is a distributed system for interactive analysis of large-scale
>>> datasets. Drill is similar to Google's Dremel, with the additional
>>> flexibility needed to support a broader range of query languages, data
>>> formats and data sources. It is designed to efficiently process nested
>>> data. It is a design goal to scale to 10,000 servers or more and to be able
>>> to process petabyes of data and trillions of records in seconds.
>>> +
>>> +== Background ==
>>> +Many organizations have the need to run data-intensive applications,
>>> including batch processing, stream processing and interactive analysis. In
>>> recent years open source systems have emerged to address the need for
>>> scalable batch processing (Apache Hadoop) and stream processing (Storm,
>>> Apache S4). In 2010 Google published a paper called "Dremel: Interactive
>>> Analysis of Web-Scale Datasets," describing a scalable system used
>>> internally for interactive analysis of nested data. No open source project
>>> has successfully replicated the capabilities of Dremel.
>>> +
>>> +== Rationale ==
>>> +There is a strong need in the market for low-latency interactive
>>> analysis of large-scale datasets, including nested data (eg, JSON, Avro,
>>> Protocol Buffers). This need was identified by Google and addressed
>>> internally with a system called Dremel.
>>> +
>>> +In recent years open source systems have emerged to address the need
>>> for scalable batch processing (Apache Hadoop) and stream processing (Storm,
>>> Apache S4). Apache Hadoop, originally inspired by Google's internal
>>> MapReduce system, is used by thousands of organizations processing
>>> large-scale datasets. Apache Hadoop is designed to achieve very high
>>> throughput, but is not designed to achieve the sub-second latency needed
>>> for interactive data analysis and exploration. Drill, inspired by Google's
>>> internal Dremel system, is intended to address this need.
>>> +
>>> +It is worth noting that, as explained by Google in the original paper,
>>> Dremel complements MapReduce-based computing. Dremel is not intended as a
>>> replacement for MapReduce and is often used in conjunction with it to
>>> analyze outputs of MapReduce pipelines or rapidly prototype larger
>>> computations. Indeed, Dremel and MapReduce are both used by thousands of
>>> Google employees.
>>> +
>>> +Like Dremel, Drill supports a nested data model with data encoded in a
>>> number of formats such as JSON, Avro or Protocol Buffers. In many
>>> organizations nested data is the standard, so supporting a nested data
>>> model eliminates the need to normalize the data. With that said, flat data
>>> formats, such as CSV files, are naturally supported as a special case of
>>> nested data.
>>> +
>>> +The Drill architecture consists of four key components/layers:
>>> + * Query languages: This layer is responsible for parsing the user's
>>> query and constructing an execution plan.  The initial goal is to support
>>> the SQL-like language used by Dremel and [[https://developers.google.**
>>> com/bigquery/docs/query-**reference|Google<https://developers.google.com/bigquery/docs/query-reference%7CGoogle>BigQuery]],
>>>  which we call DrQL. However, Drill is designed to support other
>>> languages and programming models, such as the [[http://www.mongodb.org/*
>>> *display/DOCS/Mongo+Query+**Language|Mongo<http://www.mongodb.org/display/DOCS/Mongo+Query+Language%7CMongo>Query
>>>  Language]], [[
>>> http://www.cascading.org/|**Cascading<http://www.cascading.org/%7CCascading>]]
>>> or 
>>> [[https://github.com/tdunning/**Plume|Plume]<https://github.com/tdunning/Plume%7CPlume%5D>
>>> ].
>>> + * Low-latency distributed execution engine: This layer is responsible
>>> for executing the physical plan. It provides the scalability and fault
>>> tolerance needed to efficiently query petabytes of data on 10,000 servers.
>>> Drill's execution engine is based on research in distributed execution
>>> engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
>>> storage, and can be extended with additional operators and connectors.
>>> + * Nested data formats: This layer is responsible for supporting
>>> various data formats. The initial goal is to support the column-based
>>> format used by Dremel. Drill is designed to support schema-based formats
>>> such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and
>>> schema-less formats such as JSON, BSON or YAML. In addition, it is designed
>>> to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile,
>>> and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
>>> particular distinction with Drill is that the execution engine is flexible
>>> enough to support column-based processing as well as row-based processing.
>>> This is important because column-based processing can be much more
>>> efficient when the data is stored in a column-based format, but many large
>>> data assets are stored in a row-based format that would require conversion
>>> before use.
>>> + * Scalable data sources: This layer is responsible for supporting
>>> various data sources. The initial focus is to leverage Hadoop as a data
>>> source.
>>> +
>>> +It is worth noting that no open source project has successfully
>>> replicated the capabilities of Dremel, nor have any taken on the broader
>>> goals of flexibility (eg, pluggable query languages, data formats, data
>>> sources and execution engine operators/connectors) that are part of Drill.
>>> +
>>> +== Initial Goals ==
>>> +The initial goals for this project are to specify the detailed
>>> requirements and architecture, and then develop the initial implementation
>>> including the execution engine and DrQL.
>>> +Like Apache Hadoop, which was built to support multiple storage systems
>>> (through the FileSystem API) and file formats (through the
>>> InputFormat/OutputFormat APIs), Drill will be built to support multiple
>>> query languages, data formats and data sources. The initial implementation
>>> of Drill will support the DrQL and a column-based format similar to Dremel.
>>> +
>>> +== Current Status ==
>>> +Significant work has been completed to identify the initial
>>> requirements and define the overall system architecture. The next step is
>>> to implement the four components described in the Rationale section, and we
>>> intend to do that development as an Apache project.
>>> +
>>> +=== Meritocracy ===
>>> +We plan to invest in supporting a meritocracy. We will discuss the
>>> requirements in an open forum. Several companies have already expressed
>>> interest in this project, and we intend to invite additional developers to
>>> participate. We will encourage and monitor community participation so that
>>> privileges can be extended to those that contribute. Also, Drill has an
>>> extensible/pluggable architecture that encourages developers to contribute
>>> various extensions, such as query languages, data formats, data sources and
>>> execution engine operators and connectors. While some companies will surely
>>> develop commercial extensions, we also anticipate that some companies and
>>> individuals will want to contribute such extensions back to the project,
>>> and we look forward to fostering a rich ecosystem of extensions.
>>> +
>>> +=== Community ===
>>> +The need for a system for interactive analysis of large datasets in the
>>> open source is tremendous, so there is a potential for a very large
>>> community. We believe that Drill's extensible architecture will further
>>> encourage community participation. Also, related Apache projects (eg,
>>> Hadoop) have very large and active communities, and we expect that over
>>> time Drill will also attract a large community.
>>> +
>>> +=== Core Developers ===
>>> +The developers on the initial committers list include experienced
>>> distributed systems engineers:
>>> + * Tomer Shiran has experience developing distributed execution
>>> engines. He developed Parallel DataSeries, a data-parallel version of the
>>> open source 
>>> [[http://tesla.hpl.hp.com/**opensource/|DataSeries<http://tesla.hpl.hp.com/opensource/%7CDataSeries>]]
>>> system. He is also the author of Applying Idealized Lower-bound Runtime
>>> Models to Understand Inefficiencies in Data-intensive Computing (SIGMETRICS
>>> 2011). Tomer worked as a software developer and researcher at IBM Research,
>>> Microsoft and HP Labs, and is now at MapR Technologies. He has been active
>>> in the Hadoop community since 2009.
>>> + * Jason Frantz was at Clustrix, where he designed and developed the
>>> first scale-out SQL database based on MySQL. Jason developed the
>>> distributed query optimizer that powered Clustrix. He is now a software
>>> engineer and architect at MapR Technologies.
>>> + * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout,
>>> and has a history of over 30 years of contributions to open source. He is
>>> now at MapR Technologies. Ted has been very active in the Hadoop community
>>> since the project's early days.
>>> + * MC Srivas is the co-founder and CTO of MapR Technologies. While at
>>> Google he worked on Google's scalable search infrastructure. MC Srivas has
>>> been active in the Hadoop community since 2009.
>>> + * Chris Wensel is the founder and CEO of Concurrent. Prior to founding
>>> Concurrent, he developed Cascading, an Apache-licensed open source
>>> application framework enabling Java developers to quickly and easily
>>> develop robust Data Analytics and Data Management applications on Apache
>>> Hadoop. Chris has been involved in the Hadoop community since the project's
>>> early days.
>>> + * Keys Botzum was at IBM, where he worked on security and distributed
>>> systems, and is currently at MapR Technologies.
>>> + * Gera Shegalov was at Oracle, where he worked on networking, storage
>>> and database kernels, and is currently at MapR Technologies.
>>> + * Ryan Rawson is the VP Engineering of Drawn to Scale where he
>>> developed Spire, a real-time operational database for Hadoop. He is also a
>>> committer and PMC member for Apache HBase, and has a long history of
>>> contributions to open source. Ryan has been involved in the Hadoop
>>> community since the project's early days.
>>> +
>>> +We realize that additional employer diversity is needed, and we will
>>> work aggressively to recruit developers from additional companies.
>>> +
>>> +=== Alignment ===
>>> +The initial committers strongly believe that a system for interactive
>>> analysis of large-scale datasets will gain broader adoption as an open
>>> source, community driven project, where the community can contribute not
>>> only to the core components, but also to a growing collection of query
>>> languages and optimizers, data formats, data formats, and execution engine
>>> operators and connectors. Drill will integrate closely with Apache Hadoop.
>>> First, the data will live in Hadoop. That is, Drill will support Hadoop
>>> FileSystem implementations and HBase. Second, Hadoop-related data formats
>>> will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools
>>> will be provided to produce column-based formats. Fourth, Drill tables can
>>> be registered in HCatalog. Finally, Hive is being considered as the basis
>>> of the DrQL implementation.
>>> +
>>> +== Known Risks ==
>>> +
>>> +=== Orphaned Products ===
>>> +The contributors are leading vendors in this space, with significant
>>> open source experience, so the risk of being orphaned is relatively low.
>>> The project could be at risk if vendors decided to change their strategies
>>> in the market. In such an event, the current committers plan to continue
>>> working on the project on their own time, though the progress will likely
>>> be slower. We plan to mitigate this risk by recruiting additional
>>> committers.
>>> +
>>> +=== Inexperience with Open Source ===
>>> +The initial committers include veteran Apache members (committers and
>>> PMC members) and other developers who have varying degrees of experience
>>> with open source projects. All have been involved with source code that has
>>> been released under an open source license, and several also have
>>> experience developing code with an open source development process.
>>> +
>>> +=== Homogenous Developers ===
>>> +The initial committers are employed by a number of companies, including
>>> MapR Technologies, Concurrent and Drawn to Scale. We are committed to
>>> recruiting additional committers from other companies.
>>> +
>>> +=== Reliance on Salaried Developers ===
>>> +It is expected that Drill development will occur on both salaried time
>>> and on volunteer time, after hours. The majority of initial committers are
>>> paid by their employer to contribute to this project. However, they are all
>>> passionate about the project, and we are confident that the project will
>>> continue even if no salaried developers contribute to the project. We are
>>> committed to recruiting additional committers including non-salaried
>>> developers.
>>> +
>>> +=== Relationships with Other Apache Products ===
>>> +As mentioned in the Alignment section, Drill is closely integrated with
>>> Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data
>>> lives inside a Hadoop environment (Drill operates on in situ data). We look
>>> forward to collaborating with those communities, as well as other Apache
>>> communities.
>>> +
>>> +=== An Excessive Fascination with the Apache Brand ===
>>> +Drill solves a real problem that many organizations struggle with, and
>>> has been proven within Google to be of significant value. The architecture
>>> is based on academic and industry research. Our rationale for developing
>>> Drill as an Apache project is detailed in the Rationale section. We believe
>>> that the Apache brand and community process will help us attract more
>>> contributors to this project, and help establish ubiquitous APIs. In
>>> addition, establishing consensus among users and developers of a
>>> Dremel-like tool is a key requirement for success of the project.
>>> +
>>> +== Documentation ==
>>> +Drill is inspired by Google's Dremel. Google has published a [[
>>> http://research.google.com/**pubs/pub36632.html|paper<http://research.google.com/pubs/pub36632.html%7Cpaper>]]
>>> highlighting Dremel's innovative nested column-based data format and
>>> execution engine.
>>> +
>>> +== Initial Source ==
>>> +The requirement and design documents are currently stored in MapR
>>> Technologies' source code repository. They will be checked in as part of
>>> the initial code dump. Check out the [[attachment:Drill slides.pdf|attached
>>> slides]].
>>> +
>>> +== Cryptography ==
>>> +Drill will eventually support encryption on the wire. This is not one
>>> of the initial goals, and we do not expect Drill to be a controlled export
>>> item due to the use of encryption.
>>> +
>>> +== Required Resources ==
>>> +
>>> +=== Mailing List ===
>>> + * drill-private
>>> + * drill-dev
>>> + * drill-user
>>> +
>>> +=== Subversion Directory ===
>>> +Git is the preferred source control system: git://git.apache.org/drill
>>> +
>>> +=== Issue Tracking ===
>>> +JIRA Drill (DRILL)
>>> +
>>> +== Initial Committers ==
>>> + * Tomer Shiran <tshiran at maprtech dot com>
>>> + * Ted Dunning <tdunning at apache dot org>
>>> + * Jason Frantz <jfrantz at maprtech dot com>
>>> + * MC Srivas <mcsrivas at maprtech dot com>
>>> + * Chris Wensel <chris and concurrentinc dot com>
>>> + * Keys Botzum <kbotzum at maprtech dot com>
>>> + * Gera Shegalov <gshegalov at maprtech dot com>
>>> + * Ryan Rawson <ryan at drawntoscale dot com>
>>> +
>>> +== Affiliations ==
>>> +The initial committers are employees of MapR Technologies, Drawn to
>>> Scale and Concurrent. The nominated mentors are employees of MapR
>>> Technologies, Lucid Imagination and Nokia.
>>> +
>>> +== Sponsors ==
>>> +
>>> +=== Champion ===
>>> +Ted Dunning (tdunning at apache dot org)
>>> +
>>> +=== Nominated Mentors ===
>>> + * Ted Dunning <tdunning at apache dot org> – Chief Application
>>> Architect at MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
>>> + * Grant Ingersoll <grant at lucidimagination dot com> – Chief
>>> Scientist at Lucid Imagination, Committer for Lucene, Mahout and other
>>> projects.
>>> + * Isabel Drost <isabel at apache dot org> – Software Developer at
>>> Nokia Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
>>> +
>>> +=== Sponsoring Entity ===
>>> +Incubator
>>> +
>>>
>>>
>>
>

Re: git commit: First commit

Reply via email to