Re: [VOTE] Accept Zeppelin into the Apache Incubator

2014-12-20 Thread Ate Douma

+1 (binding)

On 2014-12-19 06:29, Roman Shaposhnik wrote:

Following the discussion earlier:
 http://s.apache.org/kTp

I would like to call a VOTE for accepting
Zeppelin as a new Incubator project.

The proposal is available at:
 https://wiki.apache.org/incubator/ZeppelinProposal
and is also attached to the end of this email.

Vote is open until at least Sunday, 21th December 2014,
23:59:00 PST

[ ] +1 Accept Zeppelin into the Incubator
[ ] ±0 Indifferent to the acceptance of Zeppelin
[ ] -1 Do not accept Zeppelin because ...

Thanks,
Roman.

== Abstract ==
Zeppelin is a collaborative data analytics and visualization tool for
distributed, general-purpose data processing systems such as Apache
Spark, Apache Flink, etc.

== Proposal ==
Zeppelin is a modern web-based tool for the data scientists to
collaborate over large-scale data exploration and visualization
projects. It is a notebook style interpreter that enable collaborative
analysis sessions sharing between users. Zeppelin is independent of
the execution framework itself. Current version runs on top of Apache
Spark but it has pluggable interpreter APIs to support other data
processing systems. More execution frameworks could be added at a
later date i.e Apache Flink, Crunch as well as SQL-like backends such
as Hive, Tajo, MRQL.

We have a strong preference for the project to be called Zeppelin. In
case that may not be feasible, alternative names could be: “Mir”,
“Yuga” or “Sora”.

== Background ==
Large scale data analysis workflow includes multiple steps like data
acquisition, pre-processing, visualization, etc and may include
inter-operation of multiple different tools and technologies. With the
widespread of the open source general-purpose data processing systems
like Spark there is a lack of open source, modern user-friendly tools
that combine strengths of interpreted language for data analysis with
new in-browser visualization libraries and collaborative capabilities.

Zeppelin initially started as a GUI tool for diverse set of
SQL-over-Hadoop systems like Hive, Presto, Shark, etc. It was open
source since its inception in Sep 2013. Later, it became clear that
there was a need for a greater web-based tool for data scientists to
collaborate on data exploration over the large-scale projects, not
limited to SQL. So Zeppelin integrated full support of Apache Spark
while adding a collaborative environment with the ability to run and
share interpreter sessions in-browser

== Rationale ==
There are no open source alternatives for a collaborative
notebook-based interpreter with support of multiple distributed data
processing systems.

As a number of companies adopting and contributing back to Zeppelin is
growing, we think that having a long-term home at Apache foundation
would be a great fit for the project ensuring that processes and
procedures are in place to keep project and community “healthy” and
free of any commercial, political or legal faults.

== Initial Goals ==
The initial goals will be to move the existing codebase to Apache and
integrate with the Apache development process. This includes moving
all infrastructure that we currently maintain, such as: a website, a
mailing list, an issues tracker and a Jenkins CI, as mentioned in
“Required Resources” section of current proposal.
Once this is accomplished, we plan for incremental development and
releases that follow the Apache guidelines.
To increase adoption the major goal for the project would be to
provide integration with as much projects from Apache data ecosystem
as possible, including new interpreters for Apache Hive, Apache Drill
and adding Zeppelin distribution to Apache Bigtop.
On the community building side the main goal is to attract a diverse
set of contributors by promoting Zeppelin to wide variety of
engineers, starting a Zeppelin user groups around the globe and by
engaging with other existing Apache projects communities online.


== Current Status ==
Currently, Zeppelin has 4 released versions and is used in production
at a number of companies across the globe mentioned in Affiliation
section. Current implementation status is pre-release with public API
not being finalized yet. Current main and default backend processing
engine is Apache Spark with consistent support of SparkSQL.
Zeppelin is distributed as a binary package which includes an embedded
webserver, application itself, a set of libraries and startup/shutdown
scripts. No platform-specific installation packages are provided yet
but it is something we are looking to provide as part of Apache Bigtop
integration.
Project codebase is currently hosted at github.com, which will form
the basis of the Apache git repository.

=== Meritocracy ===
Zeppelin is an open source project that already leverages meritocracy
principles.  It was started by a handfull of people and now it has
multiple contributors, although as the number of contribution grows we
want to build a diverse developer and user community that is governed
b

Re: Votes for git repos - commit id vs tag

2014-12-20 Thread Bertrand Delacretaz
On Sat, Dec 20, 2014 at 7:16 AM, Niclas Hedhman  wrote:
> ...Releases are the tarball(s) prepared by the release manager, not a pointer
> into the source control system

Agreed. I also agree with Brane about the pointer into source code
control system being useful for PMC members to check that the released
code is what they expect, but as you say long-term it's only the
signed release tarball that matters.

> ...So, to make this clear to the community, I would discourage to publish the
> commit ID in the vote request, and only provide the URL link to the
> tarball(s)

The way we work in Sling is that the tarball's name points to a
well-known svn tag URL. This matches your idea of having the commit ID
or equivalent somewhere else, but easily accessible. I like that.

OTOH I also like to include the tarball archive's digest (sha1 or
equivalent) in the archived vote thread as that's a long term (*)
guarantee that what you got is what was voted on.

-Bertrand

(*) As long as the digest algorithm is not broken, that is.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Incubator report sign-off

2014-12-20 Thread Branko Čibej
On 19 December 2014 at 18:10, Rich Bowen  wrote:
> I certainly don't expect that every mentor has their full attention on a
> podling every month, but I do expect that a podling that cares about its
> incubation will seek out that mentor sign-off, and that the mentors who have
> committed to help a podling into the family will have a few moments every
> few months to look in and approve a report.


I have to disagree. If someone volunteers to be a mentor, they should
commit to checking the podling's progress on a daily basis, not just
once every few months. There are some people on the IPMC who are
mentors to a plethora of podlings; I can never understand how they
expect to do their job, and the fact that there are so many absent
mentors tends to suggest that they don't.

Certainly, we're all volunteers here. But being a volunteer does not
imply that one doesn't have to take their task seriously. If someone
runs out of time, the least I'd expect would be notifying the IPMC and
the podling about that and arranging for a new mentor. Otherwise, I
expect not only monthly (or quarterly) sign-offs, but regular
oversight and moderately active involvement in the community. Because
after all, how are people supposed to learn how things are done
hereabouts, if not from active mentors?

In that note ... i'd propose that, if a mentor does not sign off on a
report, said mentor should be reminded once; if nothing changes, they
should be removed and replaced by someone else.

-- Brane

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Votes for git repos - commit id vs tag

2014-12-20 Thread Branko Čibej
On 20.12.2014 07:16, Niclas Hedhman wrote:
> Tags are at best a convenience, and nothing else. But so are commit id,
> since in the long-term, GIT may not prevail and the commit id is in effect
> an internal artifact of Git itself, not the concept of version control
> systems. Compare how commit numbers from Subversion are imported to Git
> repositories, or not... But tags are imported, if the ttb structure in
> subversion is used.

Any release is cut from a current canonical repository, which is always
hosted on ASF infrastructure. The point is that current releases should
be identifiable in the current repository, because anyone who votes on a
release /should/ verify that the tarball matches some state in the repo;
otherwise they don't know what they're signing, and the release isn't
repeatable; that would sort of negate the whole point of version control.

In the case of Git, the commit-id is the most stable global identifier
for a particular state of the repository. (I say "most stable" because,
in general, Git history is mutable ... sigh).

If at some future date the repository is imported in some shiny new
version control system, that new system is bound to have some kind of
global state identifier, mutable or not; and commit-ids may or may not
be accurately represented by it; but that's completely irrelevant for
current releases. It's marginally relevant for reproducing past
releases, but that can be solved by archiving the whole "old" repository.

-- Brane

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org