DataSourceV2 sync notes - 24 July 2019

2019-08-06 Thread Ryan Blue
Here are my notes from the last DSv2 sync. Sorry it's a bit late!

*Attendees*:

Ryan Blue
John Zhuge
Raynmond McCollum
Terry Kim
Gengliang Wang
Jose Torres
Wenchen Fan
Priyanka Gomatam
Matt Cheah
Russel Spitzer
Burak Yavuz

*Topics*:

   - Check in on blockers
  - Remove SaveMode
  - Reorganize code - waiting for INSERT INTO?
  - Write docs - should be done after 3.0 branching
   - Open PRs
  - V2 session catalog config:
  https://github.com/apache/spark/pull/25104
  - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
  - INSERT INTO: https://github.com/apache/spark/pull/24832
  - SupportsNamespaces: https://github.com/apache/spark/pull/24560
  - SHOW TABLES: https://github.com/apache/spark/pull/25247
  - DELETE FROM: https://github.com/apache/spark/pull/21308 and
  https://github.com/apache/spark/pull/25115
   - DELETE FROM approach
   - Filter push-down and stats - move to optimizer?
   - Use v2 ALTER TABLE implementations for v1 tables
   - CatalogPlugin changes
   - Reuse the existing Parquet readers?

*Discussion*:

   - Blockers
  - Remove SaveMode from file sources: Blocked by
  TableProvider/CatalogPlugin changes. Doesn’t work with all of the using
  clauses from v1, like JDBC. Working on a CatalogPlugin fix.
  - Reorganize packages: Blocked by outstanding INSERT INTO PRs
  - Docs: Ryan: docs can be written after branching, so focus should be
  on stability right now
  - Any other blockers? Please send them to Ryan to track
   - V2 session catalog config PR:
  - Wenchen: this will be included in CatalogPlugin changes
   - DESCRIBE TABLE PR:
  - Matt: waiting for review
  - Burak: partitioning is strange, uses “Part 0” instead of names
  - Ryan: there are no names for transform partitions (identity
  partitions use column names)
  - Conclusion: not a big problem since there is no required schema, we
  can update later if better ideas come up
   - INSERT INTO PR:
  - Ryan: ready for another review, DataFrameWriter.insertInto PR will
  follow
   - SupportsNamespaces PR:
  - Ryan: ready for another review
   - SHOW TABLES PR:
  - Terry: there are open questions: what is the current database for
  v2?
  - Ryan: there should be a current namespace in the SessionState. This
  could be per catalog?
  - Conclusion: do not track current namespace per catalog. Reset to a
  catalog default when current catalog changes
  - Ryan: will add SupportsNamespace method for default namespace to
  initialize current.
  - Burak: USE foo.bar could set both
  - What is SupportsNamespaces is not implemented? Default to Seq.empty
  - Terry: should listing methods support search patterns?
  - Ryan: this adds complexity that should be handled by Spark instead
  of complicating the API. There isn’t a performance need to push this down
  because we don’t expect high cardinality for a namespace level.
  - Conclusion: implement in SHOW TABLES exec
  - Terry: how should temporary tables be handled?
  - Wenchen: temporary table is an alias for temporary view. SHOW TABLES
  does list temporary views, v2 should implement the same behavior.
  - Terry: support EXTENDED?
  - Ryan: This can be done later.
   - DELETE FROM PR:
  - Wenchen: DELETE FROM just passes filters to the data source to
  delete
  - Ryan: Instead of a complicated builder, let’s solve just the simple
  case (filters) and not the row-level delete case. If we do that, then we
  can use a simple SupportsDelete interface and put off row-level delete
  design
  - Consensus was to add a SupportsDelete interface for Table and not a
  new builder
   - Stats push-down fix:
  - Ryan: briefly looked into it and this can probably be done earlier,
  in the optimizer by creating a scan early and a special logical plan to
  wrap a scan. This isn’t a good long-term solution but would fix stats for
  the release. Write side would not change.
  - Ryan will submit a PR with the implementation
   - Using ALTER TABLE implementations for v1
  - Burak: Took a stab at this, but ran into problems. Would be nice if
  all DDL for v1 were supported through v2 API
  - DDL doesn’t work with v1 for custom data sources - if the source of
  truth is not Hive
  - Matt: v2 should be used to change the source of truth. v1 behavior
  is to only change the session catalog (e.g., Hive).
  - Matt: is v1 deprecated?
  - Wenchen, not until stable
  - Burak: can’t deprecate yet
  - Burak: CTAS and RTAS could also call v1
  - Ryan: We could build a v2 implementation that calls v1, but only
  append and read could be supported because v1 overwrite behavior is
  unreliable across sources.
   - Ran out of time
  - Wenchen’s CatalogPlugin changes can be discussed next time
  - Ryan will follow up with Raymo

DISCUSS [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-08-06 Thread Tom Graves
Hey everyone,
I have been working on coming up with a proposal for supporting stage level 
resource configuration and scheduling.  The basic idea is to allow the user to 
specify executor and task resource requirements for each stage to allow the 
user to control the resources required at a finer grain. One good example here 
is doing some ETL to preprocess your data in one stage and then feed that data 
into an ML algorithm (like tensorflow) that would run as a separate stage.  The 
ETL could need totally different resource requirements for the executors/tasks 
than the ML stage does.  
If you are interested please take a look at the SPIP and give me feedback.  The 
text for the SPIP is in the jira description:
https://issues.apache.org/jira/browse/SPARK-27495

I split the API and Design parts into a google doc that is linked to from the 
jira.
Thanks,Tom

Re: Recognizing non-code contributions

2019-08-06 Thread Sean Owen
On Tue, Aug 6, 2019 at 11:45 AM Myrle Krantz  wrote:
> I had understood your position to be that you would be willing to make at 
> least some non-coding contributors to committers but that your "line" is 
> somewhat different than my own.   My response to you assumed that position on 
> your part.  I do not think it's good for a project to accept absolutely no 
> non-code committers.  If nothing else, it violates my sense of fairness, both 
> towards those contributors, and also towards the ASF which relies on a 
> pipeline of non-code contributors who come to us through the projects.

Oh OK, I thought this argument was made repeatedly: someone who has
not and evidently will not ever commit anything to a project doesn't
seem to need the commit bit. Agree to disagree. That was the
'non-code' definition?

Someone who contributes docs to the project? Sure. We actually have
done this, albeit for a build and config contributions. Agree.

Pardon a complicated analogy to explain my thinking, but: let's say
the space of acceptable decisions on adding committers at the ASF
ranges from 1 (Super Aggressive) to 10 (Very Conservative). Most
project decisions probably fall in, say, 3 to 7. Here we're debating
whether a project should theoretically at times go all the way to 1,
or at most 2, and I think that's just not that important. We're pretty
much agreeing 2 is not out of the question, 1 we agree to disagree.

Spark decisions here are probably 5-7 on average. I'd like it be like
4-6 personally. I suspect the real inbound argument is: all projects
should be making all decisions in 1-3 or else it isn't The Apache Way.
I accept anecdotes that projects function well in that range, but,
Spark and Hadoop don't seem to (nor evidently Cassandra). I have a
hard time rationalizing this. These are, after all, some of the
biggest and most successful projects at Apache. At times it sounds
like concern trolling, to 'help' these projects not fall apart.

If so, you read correctly that there is a significant difference of
opinion here, but that's what it is. Not the theoretical debate above.

Spark should shift, but equally, so should this viewpoint from some at
the ASF, as I find my caricature of it equally suboptimal.
Shred that analogy as you like, but it explains what's in my head.


> For more documentation on the definition of a committer at Apache, read here: 
> https://community.apache.org/contributors/  "Being a committer does not 
> necessarily mean you commit code, it means you are committed to the project 
> and are productively contributing to its success."

Per above, I just don't think this statement should be in the canon,
and would prefer to clarify it, but hey it is there and I accept it.
Still: what's committed? I'd define committed to the project, as,
well, working on the project's output. It just punts the question.


> I also don't yet see a "serious offense" here.  My e-mail to board@ is simply 
> a heads up, which I do owe the rest of the board when I'm interacting with 
> one of our projects.  Here are my exact words: "Most of that discussion is 
> fairly harmless.  Some of it, I have found concerning."  Right now, I'm still 
> trying to approach Spark's position with a learning-and-teaching mindset.

I'm nitpicking your words, which are by themselves reasonable. I think
learning-and-teaching is just the right attitude.
But have you heard different ideas here that are valid or merely "not
harmful"? are the ideas you don't share just not your choice or
"concerning"?

I'm afraid it primes people to drive by to feel good delivering the
safe, conventional mom-and-apple-pie ideals: what are you afraid of?
what is your problem with openness? why do you hate freedom and The
Apache Way? We'll have another round of throw-the-bums-out,
shut-it-all-down threads. These aren't wrong ideals. It just generates
no useful discussion, and is patronizing. I find it hard to dissent
reasonably.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Recognizing non-code contributions

2019-08-06 Thread Myrle Krantz
On Tue, Aug 6, 2019 at 6:11 PM Sean Owen  wrote:

> On Tue, Aug 6, 2019 at 10:46 AM Myrle Krantz  wrote:
> >> You can tell there's a range of opinions here. I'm probably less
> >> 'conservative' about adding committers than most on the PMC, right or
> >> wrong, but more conservative than some at the ASF. I think there's
> >> room to inch towards the middle ground here and this is good
> >> discussion informing the thinking.
> >
> >
> > That's not actually my current reading of the Spark community.  My
> current reading based on the responses of Hyukjin, and Jungtaek, is that
> your community wouldn't take a non-coding committer no matter how clear
> their contributions are to the community, and that by extension such a
> person could never become a PMC member.
> >
> > If my reading is correct (and the sample size *is* still quite small,
> and only includes one PMC member), I see that as a serious problem.
>
> Again if "non-code" means "no interaction with the project repo", no I
> do not hear support for making said person a committer for all the
> reasons you've heard here. I don't support it.
>
> Wait, didn't we just get done agreeing that's a reasonable position if
> not one you hold? I'm quite confused.
> It's fine to invite the board, members to come participate here as you
> have just done separately, but you're now portraying this as a serious
> offense, despite your comments here?
>

I think both representations of my position are inaccurate.

I had understood your position to be that you would be willing to make at
least some non-coding contributors to committers but that your "line" is
somewhat different than my own.   My response to you assumed that position
on your part.  I do not think it's good for a project to accept absolutely
no non-code committers.  If nothing else, it violates my sense of fairness,
both towards those contributors, and also towards the ASF which relies on a
pipeline of non-code contributors who come to us through the projects.

For more documentation on the definition of a committer at Apache, read
here: https://community.apache.org/contributors/  "Being a committer does
not necessarily mean you commit code, it means you are committed to the
project and are productively contributing to its success."

I also don't yet see a "serious offense" here.  My e-mail to board@ is
simply a heads up, which I do owe the rest of the board when I'm
interacting with one of our projects.  Here are my exact words: "Most of
that discussion is fairly harmless.  Some of it, I have found concerning."
Right now, I'm still trying to approach Spark's position with a
learning-and-teaching mindset.

Does that make it clearer?

Best Regards,
Myrle


Re: Recognizing non-code contributions

2019-08-06 Thread Sean Owen
On Tue, Aug 6, 2019 at 10:46 AM Myrle Krantz  wrote:
>> You can tell there's a range of opinions here. I'm probably less
>> 'conservative' about adding committers than most on the PMC, right or
>> wrong, but more conservative than some at the ASF. I think there's
>> room to inch towards the middle ground here and this is good
>> discussion informing the thinking.
>
>
> That's not actually my current reading of the Spark community.  My current 
> reading based on the responses of Hyukjin, and Jungtaek, is that your 
> community wouldn't take a non-coding committer no matter how clear their 
> contributions are to the community, and that by extension such a person could 
> never become a PMC member.
>
> If my reading is correct (and the sample size *is* still quite small, and 
> only includes one PMC member), I see that as a serious problem.

Again if "non-code" means "no interaction with the project repo", no I
do not hear support for making said person a committer for all the
reasons you've heard here. I don't support it.

Wait, didn't we just get done agreeing that's a reasonable position if
not one you hold? I'm quite confused.
It's fine to invite the board, members to come participate here as you
have just done separately, but you're now portraying this as a serious
offense, despite your comments here?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon
> I wonder which project nominees non-coding only committers but I at least
know multiple projects. They all have that serious problem then.

I mean It know multiple projects don't do that and according to what you
said, they all have that serious problem.

2019년 8월 7일 (수) 오전 1:05, Hyukjin Kwon 님이 작성:

> Well, actually I am rather less conservative on adding committers. There
> are multiple people who are active in both non-coding and coding activities.
> I as an example am one of Korean meetup admin and my main focus was to
> management JIRA. In addition, review the PRs that are not being reviewed.
> As I said earlier at the very first time, I think committers should
> ideally be used to the dev at some degrees as primary. Other contributions
> should be counted.
>
> I wonder which project nominees non-coding only committers but I at least
> know multiple projects. They all have that serious problem then.
>
> 2019년 8월 7일 (수) 오전 12:46, Myrle Krantz 님이 작성:
>
>>
>>
>> On Tue, Aug 6, 2019 at 5:36 PM Sean Owen  wrote:
>>
>>> You can tell there's a range of opinions here. I'm probably less
>>> 'conservative' about adding committers than most on the PMC, right or
>>> wrong, but more conservative than some at the ASF. I think there's
>>> room to inch towards the middle ground here and this is good
>>> discussion informing the thinking.
>>>
>>
>> That's not actually my current reading of the Spark community.  My
>> current reading based on the responses of Hyukjin, and Jungtaek, is that
>> your community wouldn't take a non-coding committer no matter how clear
>> their contributions are to the community, and that by extension such a
>> person could never become a PMC member.
>>
>> If my reading is correct (and the sample size *is* still quite small, and
>> only includes one PMC member), I see that as a serious problem.
>>
>> How do the other PMC members and community members see this?
>>
>> Best Regards,
>> Myrle
>>
>


Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon
Well, actually I am rather less conservative on adding committers. There
are multiple people who are active in both non-coding and coding activities.
I as an example am one of Korean meetup admin and my main focus was to
management JIRA. In addition, review the PRs that are not being reviewed.
As I said earlier at the very first time, I think committers should ideally
be used to the dev at some degrees as primary. Other contributions should
be counted.

I wonder which project nominees non-coding only committers but I at least
know multiple projects. They all have that serious problem then.

2019년 8월 7일 (수) 오전 12:46, Myrle Krantz 님이 작성:

>
>
> On Tue, Aug 6, 2019 at 5:36 PM Sean Owen  wrote:
>
>> You can tell there's a range of opinions here. I'm probably less
>> 'conservative' about adding committers than most on the PMC, right or
>> wrong, but more conservative than some at the ASF. I think there's
>> room to inch towards the middle ground here and this is good
>> discussion informing the thinking.
>>
>
> That's not actually my current reading of the Spark community.  My current
> reading based on the responses of Hyukjin, and Jungtaek, is that your
> community wouldn't take a non-coding committer no matter how clear their
> contributions are to the community, and that by extension such a person
> could never become a PMC member.
>
> If my reading is correct (and the sample size *is* still quite small, and
> only includes one PMC member), I see that as a serious problem.
>
> How do the other PMC members and community members see this?
>
> Best Regards,
> Myrle
>


Re: Recognizing non-code contributions

2019-08-06 Thread Holden Karau
So I’d like to add non-coding committers, I think there is great value in
both recognizing them and eventually having a broader PMC (eg maybe someone
who’s put a lot of time into teaching Spark has important things to say
about a proposed release, perhaps important enough for a binding vote).

That being said I think that my view is not aligned with the rest of the
PMC and I believe it is important that we work together so I’m ok with
exploring things like Spark VIP (I think in Kafka land there is an MVP of
the year concept but I need to talk with folks over there more for ideas on
how they run it).

I like to think the Apache way is broad enough to allow for variations like
this to be explored in different projects while sharing what has worked and
not worked historically so that we can all build healthy OSS communities.

On Tue, Aug 6, 2019 at 8:46 AM Myrle Krantz  wrote:

>
>
> On Tue, Aug 6, 2019 at 5:36 PM Sean Owen  wrote:
>
>> You can tell there's a range of opinions here. I'm probably less
>> 'conservative' about adding committers than most on the PMC, right or
>> wrong, but more conservative than some at the ASF. I think there's
>> room to inch towards the middle ground here and this is good
>> discussion informing the thinking.
>>
>
> That's not actually my current reading of the Spark community.  My current
> reading based on the responses of Hyukjin, and Jungtaek, is that your
> community wouldn't take a non-coding committer no matter how clear their
> contributions are to the community, and that by extension such a person
> could never become a PMC member.
>
> If my reading is correct (and the sample size *is* still quite small, and
> only includes one PMC member), I see that as a serious problem.
>
> How do the other PMC members and community members see this?
>
> Best Regards,
> Myrle
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Recognizing non-code contributions

2019-08-06 Thread Myrle Krantz
On Tue, Aug 6, 2019 at 5:36 PM Sean Owen  wrote:

> You can tell there's a range of opinions here. I'm probably less
> 'conservative' about adding committers than most on the PMC, right or
> wrong, but more conservative than some at the ASF. I think there's
> room to inch towards the middle ground here and this is good
> discussion informing the thinking.
>

That's not actually my current reading of the Spark community.  My current
reading based on the responses of Hyukjin, and Jungtaek, is that your
community wouldn't take a non-coding committer no matter how clear their
contributions are to the community, and that by extension such a person
could never become a PMC member.

If my reading is correct (and the sample size *is* still quite small, and
only includes one PMC member), I see that as a serious problem.

How do the other PMC members and community members see this?

Best Regards,
Myrle


Re: Recognizing non-code contributions

2019-08-06 Thread Sean Owen
On Tue, Aug 6, 2019 at 1:14 AM Myrle Krantz  wrote:
> If someone makes a commit who you are not expecting to make a commit, or in 
> an area you weren't expecting changes in, you'll notice that, right?

Not counterarguments, but just more color on the hesitation:

- Probably, but it's less obvious on a big project than a small one!
- More likely: person commits in an area they know, and it breaks
something elsewhere unexpectedly. Tests can catch most but not all of
this. That's a risk everywhere though.
- Or: most commits aren't _authored_ by committers here (I think?) but
_merged_ by committers. It's still possible to watch for this, but
harder.
- It's harder to retroactively review commits and revert if needed,
not impossible

Honestly I think most of Spark is run with the attitude you describe.
It's the core and SQL modules that generate the worry, because
correctness and semantics are much more sensitive there.

I know at one time we tried the notion of 'maintainers' for areas of
the code, where it was strongly suggested that a change to module X be
reviewed by one of a few experts on that part of the code. It was
never really enforced and so dropped. I recall it was a little
controversial at the time: why are you trying to create committers
with more/less power over the code? Yet I think that's the substance
of the 'what harm can it do?' argument: just make sure committers
stick to their appropriate area and then there's really no worry.


> In order to do that, you'd need to create this kind of in-between status 
> Apache-wide.  I would be very much opposed to doing that for a couple of 
> reasons:
> * It adds complexity for infra and events with no clear benefits to the 
> projects.
> * The risk of creating a second-class status at the ASF is just too high for 
> my comfort.

This is a good practical argument. I'd still push back on the idea
that there is no benefit (cf. supra), but it may not be worth the
headache or simple admin overhead.
I don't buy the second-class citizen argument so much. We already have
a board, PMC, committer statuses. It's just that we're used to these
tiers.
I don't think you're being dogmatic about it, but, some are, some of
the same people railing against squashing new ideas because they're
different!


> Documents are IP.  You're better off if you have an ICLA for that stuff, 
> regardless of where it lands in your project content.  And the most natural 
> point in time to request an ICLA is at committer invitation.

Side point: ICLAs are always nice-to-have but not required, no? most
contributions don't come from those with an ICLA nor would we want to
block contributions on signing one. It's just orthogonal. I take your
point that a regular contributor probably should sign an ICLA, and a
regular contributor should be a committer, just not that they should
be a committer because they should sign an ICLA.


You can tell there's a range of opinions here. I'm probably less
'conservative' about adding committers than most on the PMC, right or
wrong, but more conservative than some at the ASF. I think there's
room to inch towards the middle ground here and this is good
discussion informing the thinking.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



CVE-2019-10099: Apache Spark unencrypted data on local disk

2019-08-06 Thread Imran Rashid
 Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, Spark 2.1.x, and 2.2.x versions
Spark 2.3.0 to 2.3.2


Description:
Prior to Spark 2.3.3, in certain situations Spark would write user data to
local disk unencrypted, even if spark.io.encryption.enabled=true.  This
includes cached blocks that are fetched to disk (controlled by
spark.maxRemoteBlockSizeFetchToMem); in SparkR, using parallelize; in
Pyspark, using broadcast and parallelize; and use of python udfs.


Mitigation:
1.x, 2.0.x, 2.1.x, 2.2.x, 2.3.x  users should upgrade to 2.3.3 or newer,
including 2.4.x.

Credit:
This issue was reported by Thomas Graves of NVIDIA.

References:
https://spark.apache.org/security.html
https://issues.apache.org/jira/browse/SPARK-28626


Re: Recognizing non-code contributions

2019-08-06 Thread Jungtaek Lim
My 2 cents as just one of contributors of Apache Spark project.

The thing is, what's the merit for both contributors and PMC members on
granting committership on non-code contributors. I'd rather say someone is
a good candidate to be invited as a committer to co-maintain a part of code
repository if non-code contributions (like documentation) have been
happening on code repository. Assuming we're granting committership to
major contributors on documentation, they would maintain the doc area only
unless they're having confident on the area what they are reviewing. In
many cases, major contributions on documentation often requires major
"technical aspect" of understanding the project (I guess we're not saying
about fixing typos) which would also represent the knowledge on that area.

On the other side, if we are talking about non-code contributors who
contributes "outside" of repository, I'd say there's less (even no) merits
to grant committership. In such case, granting write privilege doesn't help
these contributors to make their contributions easier. No merits on PMC
members as well. For me, the origin meaning of "committership" is just a
"write privilege on repository". While there're more role and
responsibility as well as more merits on committership in ASF, I'd rather
think again what's the real value if the reason of granting committership
doesn't apply to the origin meaning.

If we would like to use "committership" as a recognition on major
contributions for the project in any way, I'd love to see some other
approach (like VIP? actually not sure what it meant in previous mail) to do
so. Let's focus on origin meaning of "committership", and not couple with
providing apache email address or giving chance to get various merits what
ASF committers have been enjoying. I hope there's other way to provide
these merits while we don't grant "unnecessary" privilege.

-Jungtaek Lim (HeartSaVioR)


On Tue, Aug 6, 2019 at 10:08 PM Hyukjin Kwon  wrote:

> I usually make such judgement about commit bit based upon community
> activity in coding and reviewing.
> If somebody has no activity about those commit bits, I would have no way
> to know about this guy,
> Simply I can't make a judgement about coding activity based upon
> non-coding activity.
>
> Those bugs and commit stuff are pretty critical in this project as I
> described. I would rather try to decrease such
> possibility, not increase it even when such "commit bit" is unnecessary.
>
> We have found and discussed nicer other ways to recognise them, for
> instance, listing them in somewhere else in Spark website.
> Once they are in that list, I suspect it's easier and closer to the
> committership to, say, get an Apache email if it matters.
>
> Shall we avoid such possibilities at all and go for such other safer ways?
> I think you also accept commit bit is unnecessary in this case.
> So, we don't unnecessarily give it to them, which is anyhow critical in
> this project.
>
> > Based on this argumentation you will never invite any committers or even
> merge any pull requests.
> BTW, how did you reach that conclusion? I want somebody who can review PRs
> and fix such bugs, rather than who has more possibility to make such
> mistakes.
>
>
> 2019년 8월 6일 (화) 오후 7:26, Myrle Krantz 님이 작성:
>
>> Hey Hyukjin,
>>
>> Apologies for sending this to you twice.  : o)
>>
>> On Tue, Aug 6, 2019 at 9:55 AM Hyukjin Kwon  wrote:
>>
>>> Myrle,
>>>
>>> > We need to balance two sets of risks here.  But in the case of access
>>> to our software artifacts, the risk is very small, and already has
>>> *multiple* mitigating factors, from the fact that all changes are tracked
>>> to an individual, to the fact that there are notifications sent when
>>> changes are made, (and I'm going to stop listing the benefits of a modern
>>> source control system here, because I know you are aware of them), on
>>> through the fact that you have automated tests, and continuing through the
>>> fact that there is a release process during which artifacts get checked
>>> again.
>>> > If someone makes a commit who you are not expecting to make a commit,
>>> or in an area you weren't expecting changes in, you'll notice that, right?
>>> > What you're talking about here is your security model for your source
>>> repository.  But restricting access isn't really the right security model
>>> for an open source project.
>>>
>>> I don't quite get the argument about commit bit. I _strongly_ disagree
>>> about "the risk is very small,".
>>> Not all of committers track all the changes. There are so many changes
>>> in the upstream and it's already overhead to check all.
>>> Do you know how many bugs Spark faces due to such lack of reviews that
>>> entirely blocks the release sometimes, and how much it takes time to fix up
>>> such commits?
>>> We need expertise and familiarity to Spark.
>>>
>>
>> Let's unroll that a bit.  Say that you invite a non-coding contributor to
>> be a committer.  To make an inappropriate com

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon
I usually make such judgement about commit bit based upon community
activity in coding and reviewing.
If somebody has no activity about those commit bits, I would have no way to
know about this guy,
Simply I can't make a judgement about coding activity based upon non-coding
activity.

Those bugs and commit stuff are pretty critical in this project as I
described. I would rather try to decrease such
possibility, not increase it even when such "commit bit" is unnecessary.

We have found and discussed nicer other ways to recognise them, for
instance, listing them in somewhere else in Spark website.
Once they are in that list, I suspect it's easier and closer to the
committership to, say, get an Apache email if it matters.

Shall we avoid such possibilities at all and go for such other safer ways?
I think you also accept commit bit is unnecessary in this case.
So, we don't unnecessarily give it to them, which is anyhow critical in
this project.

> Based on this argumentation you will never invite any committers or even
merge any pull requests.
BTW, how did you reach that conclusion? I want somebody who can review PRs
and fix such bugs, rather than who has more possibility to make such
mistakes.


2019년 8월 6일 (화) 오후 7:26, Myrle Krantz 님이 작성:

> Hey Hyukjin,
>
> Apologies for sending this to you twice.  : o)
>
> On Tue, Aug 6, 2019 at 9:55 AM Hyukjin Kwon  wrote:
>
>> Myrle,
>>
>> > We need to balance two sets of risks here.  But in the case of access
>> to our software artifacts, the risk is very small, and already has
>> *multiple* mitigating factors, from the fact that all changes are tracked
>> to an individual, to the fact that there are notifications sent when
>> changes are made, (and I'm going to stop listing the benefits of a modern
>> source control system here, because I know you are aware of them), on
>> through the fact that you have automated tests, and continuing through the
>> fact that there is a release process during which artifacts get checked
>> again.
>> > If someone makes a commit who you are not expecting to make a commit,
>> or in an area you weren't expecting changes in, you'll notice that, right?
>> > What you're talking about here is your security model for your source
>> repository.  But restricting access isn't really the right security model
>> for an open source project.
>>
>> I don't quite get the argument about commit bit. I _strongly_ disagree
>> about "the risk is very small,".
>> Not all of committers track all the changes. There are so many changes in
>> the upstream and it's already overhead to check all.
>> Do you know how many bugs Spark faces due to such lack of reviews that
>> entirely blocks the release sometimes, and how much it takes time to fix up
>> such commits?
>> We need expertise and familiarity to Spark.
>>
>
> Let's unroll that a bit.  Say that you invite a non-coding contributor to
> be a committer.  To make an inappropriate commit two things would have to
> happen: this person would have to decide to make the commit, and this
> person would have to set up access to the git repository, either by
> enabling gitbox integration, or accessing the apache git repository
> directly.  Before you invite them you make an estimation of the probability
> that they would do the first: that is decide to make an inappropriate
> commit.  You decide that that is fairly unlikely.  But for a non-coding
> contributor the chances of them actually going through the mechanics of
> making a commit is even more unlikely.  I think we can safely assume that
> the chance of someone who you've determined is committed to the community
> and knows their limits of doing this is simply 00.00%.
>
> That leaves the question of what the chance is that this person will leak
> their credentials to a malicious third party intent on introducing bugs
> into Spark code.  Do you believe there are such malicious third parties?
> How many attacks have there been on Spark committer credentials?  I believe
> the likelihood of this happening is 00.00% (but I am willing to be swayed
> by evidence otherwise -- should probably be discussed on the private@
> list though if it's out there.: o).
>
> But let's say I'm wrong about both of those probabilities.  Let's say the
> combined probability of one of those two things happening is actually
> 0.01%.  This is where the advantages of modern source control and tests
> come in.  Even if there's only a 50% chance that watching commits will
> catch the error, and only a further 50% chance that tests will catch the
> error, and only a further 50% chance that the error will be caught in
> release testing, those chances multiply out at 00.00125%.
>
> Based on those guestimates the risk is somewhere between 00.00% and
> 00.00125%.  The risk is very small.  You take bigger risks every day in
> order to move your project forward.
>
>
>> It virtually means we will add some more overhead to audit each commit,
>> even for committers'. Why should we bother add such overhead 

Re: Recognizing non-code contributions

2019-08-06 Thread Myrle Krantz
Hey Hyukjin,

Apologies for sending this to you twice.  : o)

On Tue, Aug 6, 2019 at 9:55 AM Hyukjin Kwon  wrote:

> Myrle,
>
> > We need to balance two sets of risks here.  But in the case of access to
> our software artifacts, the risk is very small, and already has *multiple*
> mitigating factors, from the fact that all changes are tracked to an
> individual, to the fact that there are notifications sent when changes are
> made, (and I'm going to stop listing the benefits of a modern source
> control system here, because I know you are aware of them), on through the
> fact that you have automated tests, and continuing through the fact that
> there is a release process during which artifacts get checked again.
> > If someone makes a commit who you are not expecting to make a commit, or
> in an area you weren't expecting changes in, you'll notice that, right?
> > What you're talking about here is your security model for your source
> repository.  But restricting access isn't really the right security model
> for an open source project.
>
> I don't quite get the argument about commit bit. I _strongly_ disagree
> about "the risk is very small,".
> Not all of committers track all the changes. There are so many changes in
> the upstream and it's already overhead to check all.
> Do you know how many bugs Spark faces due to such lack of reviews that
> entirely blocks the release sometimes, and how much it takes time to fix up
> such commits?
> We need expertise and familiarity to Spark.
>

Let's unroll that a bit.  Say that you invite a non-coding contributor to
be a committer.  To make an inappropriate commit two things would have to
happen: this person would have to decide to make the commit, and this
person would have to set up access to the git repository, either by
enabling gitbox integration, or accessing the apache git repository
directly.  Before you invite them you make an estimation of the probability
that they would do the first: that is decide to make an inappropriate
commit.  You decide that that is fairly unlikely.  But for a non-coding
contributor the chances of them actually going through the mechanics of
making a commit is even more unlikely.  I think we can safely assume that
the chance of someone who you've determined is committed to the community
and knows their limits of doing this is simply 00.00%.

That leaves the question of what the chance is that this person will leak
their credentials to a malicious third party intent on introducing bugs
into Spark code.  Do you believe there are such malicious third parties?
How many attacks have there been on Spark committer credentials?  I believe
the likelihood of this happening is 00.00% (but I am willing to be swayed
by evidence otherwise -- should probably be discussed on the private@ list
though if it's out there.: o).

But let's say I'm wrong about both of those probabilities.  Let's say the
combined probability of one of those two things happening is actually
0.01%.  This is where the advantages of modern source control and tests
come in.  Even if there's only a 50% chance that watching commits will
catch the error, and only a further 50% chance that tests will catch the
error, and only a further 50% chance that the error will be caught in
release testing, those chances multiply out at 00.00125%.

Based on those guestimates the risk is somewhere between 00.00% and
00.00125%.  The risk is very small.  You take bigger risks every day in
order to move your project forward.


> It virtually means we will add some more overhead to audit each commit,
> even for committers'. Why should we bother add such overhead to harm the
> project?
> To me, this is the most important fact. I don't think we should just count
> the number of positive and negative ones.
>

Based on this argumentation you will never invite any committers or even
merge any pull requests.

But you do invite committers and you do merge pull requests because it's
good for your project.  Because the risk of doing nothing is greater.


> For other reasons, we can just add or discuss about the "this kind of
> in-between status Apache-wide", which is a bigger scope than here. You can
> ask it to ASF and discuss further.
>

I can say with considerable confidence: There will be no "in-between"
status Apache-wide.  But if you disagree, and want to start a discussion to
suggest that, d...@community.apache.org is a good place to go with it.

Best Regards,
Myrle

>


Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon
So, here's my thought:

1. Back to the original point, for recognition of such people, I think we
can simply list up such people in Spark Website somewhere. For instance,

  Person A: Spark Book
  Person B: Meetup leader

I don't know if ASF allows this. Someone needs to check it.


2. If we need the in-between status officially (e.g. Apache email or
something), it should be asked and discussed in ASF, not in a single
project here.


2019년 8월 6일 (화) 오후 4:55, Hyukjin Kwon 님이 작성:

> Myrle,
>
> > We need to balance two sets of risks here.  But in the case of access to
> our software artifacts, the risk is very small, and already has *multiple*
> mitigating factors, from the fact that all changes are tracked to an
> individual, to the fact that there are notifications sent when changes are
> made, (and I'm going to stop listing the benefits of a modern source
> control system here, because I know you are aware of them), on through the
> fact that you have automated tests, and continuing through the fact that
> there is a release process during which artifacts get checked again.
> > If someone makes a commit who you are not expecting to make a commit, or
> in an area you weren't expecting changes in, you'll notice that, right?
> > What you're talking about here is your security model for your source
> repository.  But restricting access isn't really the right security model
> for an open source project.
>
> I don't quite get the argument about commit bit. I _strongly_ disagree
> about "the risk is very small,".
> Not all of committers track all the changes. There are so many changes in
> the upstream and it's already overhead to check all.
> Do you know how many bugs Spark faces due to such lack of reviews that
> entirely blocks the release sometimes, and how much it takes time to fix up
> such commits?
> We need expertise and familiarity to Spark.
>
> It virtually means we will add some more overhead to audit each commit,
> even for committers'. Why should we bother add such overhead to harm the
> project?
> To me, this is the most important fact. I don't think we should just count
> the number of positive and negative ones.
>
> For other reasons, we can just add or discuss about the "this kind of
> in-between status Apache-wide", which is a bigger scope than here. You can
> ask it to ASF and discuss further.
>
>
> 2019년 8월 6일 (화) 오후 3:14, Myrle Krantz 님이 작성:
>
>> Hey Sean,
>>
>> Even though we are discussing our differences, on the whole I don't think
>> we're that far apart in our positions.  Still the differences are where the
>> conversation is actually interesting, so here goes:
>>
>> On Mon, Aug 5, 2019 at 3:55 PM Sean Owen  wrote:
>>
>>> On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz  wrote:
>>> > So... events coordinators?  I'd still make them committers.  I guess
>>> I'm still struggling to understand what problem making people VIP's without
>>> giving them committership is trying to solve.
>>>
>>> We may just agree to disagree, which is fine, but I think the argument
>>> is clear enough: such a person has zero need for the commit bit.
>>> Turning it around, what are we trying to accomplish by giving said
>>> person a commit bit? I know people say there's no harm, but I think
>>> there is at least _some_ downside. We're widening access to change
>>> software artifacts, the main thing that we put ASF process and checks
>>> around for liability reasons. I know the point is trust, and said
>>> person is likely to understand to never use the commit bit, but it
>>> brings us back to the same place. I don't wish to convince anyone else
>>> of my stance, though I do find it more logical, just that it's
>>> reasonable within The Apache Way.
>>>
>>
>> We need to balance two sets of risks here.  But in the case of access to
>> our software artifacts, the risk is very small, and already has *multiple*
>> mitigating factors, from the fact that all changes are tracked to an
>> individual, to the fact that there are notifications sent when changes are
>> made, (and I'm going to stop listing the benefits of a modern source
>> control system here, because I know you are aware of them), on through the
>> fact that you have automated tests, and continuing through the fact that
>> there is a release process during which artifacts get checked again.
>>
>> If someone makes a commit who you are not expecting to make a commit, or
>> in an area you weren't expecting changes in, you'll notice that, right?
>>
>> What you're talking about here is your security model for your source
>> repository.  But restricting access isn't really the right security model
>> for an open source project.
>>
>>
>>> > It also just occurred to me this morning: There are actually other
>>> privileges which go along with the "commit-bit" other than the ability to
>>> commit at will to the project's repos: people who are committers get an
>>> Apache e-mail address, and they get discounted entry to ApacheCon.  People
>>> who are committers also get added

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon
Myrle,

> We need to balance two sets of risks here.  But in the case of access to
our software artifacts, the risk is very small, and already has *multiple*
mitigating factors, from the fact that all changes are tracked to an
individual, to the fact that there are notifications sent when changes are
made, (and I'm going to stop listing the benefits of a modern source
control system here, because I know you are aware of them), on through the
fact that you have automated tests, and continuing through the fact that
there is a release process during which artifacts get checked again.
> If someone makes a commit who you are not expecting to make a commit, or
in an area you weren't expecting changes in, you'll notice that, right?
> What you're talking about here is your security model for your source
repository.  But restricting access isn't really the right security model
for an open source project.

I don't quite get the argument about commit bit. I _strongly_ disagree
about "the risk is very small,".
Not all of committers track all the changes. There are so many changes in
the upstream and it's already overhead to check all.
Do you know how many bugs Spark faces due to such lack of reviews that
entirely blocks the release sometimes, and how much it takes time to fix up
such commits?
We need expertise and familiarity to Spark.

It virtually means we will add some more overhead to audit each commit,
even for committers'. Why should we bother add such overhead to harm the
project?
To me, this is the most important fact. I don't think we should just count
the number of positive and negative ones.

For other reasons, we can just add or discuss about the "this kind of
in-between status Apache-wide", which is a bigger scope than here. You can
ask it to ASF and discuss further.


2019년 8월 6일 (화) 오후 3:14, Myrle Krantz 님이 작성:

> Hey Sean,
>
> Even though we are discussing our differences, on the whole I don't think
> we're that far apart in our positions.  Still the differences are where the
> conversation is actually interesting, so here goes:
>
> On Mon, Aug 5, 2019 at 3:55 PM Sean Owen  wrote:
>
>> On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz  wrote:
>> > So... events coordinators?  I'd still make them committers.  I guess
>> I'm still struggling to understand what problem making people VIP's without
>> giving them committership is trying to solve.
>>
>> We may just agree to disagree, which is fine, but I think the argument
>> is clear enough: such a person has zero need for the commit bit.
>> Turning it around, what are we trying to accomplish by giving said
>> person a commit bit? I know people say there's no harm, but I think
>> there is at least _some_ downside. We're widening access to change
>> software artifacts, the main thing that we put ASF process and checks
>> around for liability reasons. I know the point is trust, and said
>> person is likely to understand to never use the commit bit, but it
>> brings us back to the same place. I don't wish to convince anyone else
>> of my stance, though I do find it more logical, just that it's
>> reasonable within The Apache Way.
>>
>
> We need to balance two sets of risks here.  But in the case of access to
> our software artifacts, the risk is very small, and already has *multiple*
> mitigating factors, from the fact that all changes are tracked to an
> individual, to the fact that there are notifications sent when changes are
> made, (and I'm going to stop listing the benefits of a modern source
> control system here, because I know you are aware of them), on through the
> fact that you have automated tests, and continuing through the fact that
> there is a release process during which artifacts get checked again.
>
> If someone makes a commit who you are not expecting to make a commit, or
> in an area you weren't expecting changes in, you'll notice that, right?
>
> What you're talking about here is your security model for your source
> repository.  But restricting access isn't really the right security model
> for an open source project.
>
>
>> > It also just occurred to me this morning: There are actually other
>> privileges which go along with the "commit-bit" other than the ability to
>> commit at will to the project's repos: people who are committers get an
>> Apache e-mail address, and they get discounted entry to ApacheCon.  People
>> who are committers also get added to our committers mailing list, and are
>> thus a little easier to integrate into our foundation-wide efforts.
>> >
>> > To apply this to the example above, the Apache e-mail address can make
>> it a tad easier for an event coordinator to conduct official business for a
>> project.
>>
>> Great points. Again if I'm making it up? a "VIP" should get an Apache
>> email address and discounts. Sure, why not put them on a committers@
>> list too for visibility.
>>
>
> In order to do that, you'd need to create this kind of in-between status
> Apache-wide.  I would be very much opposed to doing that fo