Re: Spark 2.4.2

2019-04-18 Thread Wenchen Fan
I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut RC2
shortly.

Thanks,
Wenchen

On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung 
wrote:

> Re shading - same argument I’ve made earlier today in a PR...
>
> (Context- in many cases Spark has light or indirect dependencies but
> bringing them into the process breaks users code easily)
>
>
> --
> *From:* Michael Heuer 
> *Sent:* Thursday, April 18, 2019 6:41 AM
> *To:* Reynold Xin
> *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen
> Fan; Xiao Li
> *Subject:* Re: Spark 2.4.2
>
> +100
>
>
> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
>
> We should have shaded all Spark’s dependencies :(
>
> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:
>
>> For users that would inherit Jackson and use it directly, or whose
>> dependencies do. Spark itself (with modifications) should be OK with
>> the change.
>> It's risky and normally wouldn't backport, except that I've heard a
>> few times about concerns about CVEs affecting Databind, so wondering
>> who else out there might have an opinion. I'm not pushing for it
>> necessarily.
>>
>> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
>> >
>> > For Jackson - are you worrying about JSON parsing for users or internal
>> Spark functionality breaking?
>> >
>> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
>> >>
>> >> There's only one other item on my radar, which is considering updating
>> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
>> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
>> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
>> >> behavior non-trivially. That said back-porting the update PR to 2.4
>> >> worked out OK locally. Any strong opinions on this one?
>> >>
>> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > I volunteer to be the release manager for 2.4.2, as I was also going
>> to propose 2.4.2 because of the reverting of SPARK-25250. Is there any
>> other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
>> start the release process today (CST).
>> >> >
>> >> > Thanks,
>> >> > Wenchen
>> >> >
>> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
>> >> >>
>> >> >> I think the 'only backport bug fixes to branches' principle remains
>> sound. But what's a bug fix? Something that changes behavior to match what
>> is explicitly supposed to happen, or implicitly supposed to happen --
>> implied by what other similar things do, by reasonable user expectations,
>> or simply how it worked previously.
>> >> >>
>> >> >> Is this a bug fix? I guess the criteria that matches is that
>> behavior doesn't match reasonable user expectations? I don't know enough to
>> have a strong opinion. I also don't think there is currently an objection
>> to backporting it, whatever it's called.
>> >> >>
>> >> >>
>> >> >> Is the question whether this needs a new release? There's no harm
>> in another point release, other than needing a volunteer release manager.
>> One could say, wait a bit longer to see what more info comes in about
>> 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move
>> towards a release cycle again. I don't see objection to that either (?)
>> >> >>
>> >> >>
>> >> >> The meta question remains: is a 'bug fix' definition even agreed,
>> and being consistently applied? There aren't correct answers, only best
>> guesses from each person's own experience, judgment and priorities. These
>> can differ even when applied in good faith.
>> >> >>
>> >> >> Sometimes the variance of opinion comes because people have
>> different info that needs to be surfaced. Here, maybe it's best to share
>> what about that offline conversation was convincing, for example.
>> >> >>
>> >> >> I'd say it's also important to separate what one would prefer from
>> what one can't live with(out). Assuming one trusts the intent and
>> experience of the handful of others with an opinion, I'd defer to someone
>> who wants X and will own it, even if I'm moderately against it. Otherwise
>> we'd get little done.
>> >> >>
>> >> >> In that light, it seems like both of the PRs at issue here are not
>> _wrong_ to backport. This is a good pair that highlights why, when there
>> isn't a clear reason to do / not do something (e.g. obvious errors,
>> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
>> >> >>
>> >> >>
>> >> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue <
>> rb...@netflix.com.invalid> wrote:
>> >> >>>
>> >> >>> Sorry, I should be more clear about what I'm trying to say here.
>> >> >>>
>> >> >>> In the past, Xiao has taken the opposite stance. A good example is
>> PR #21060 that was a very similar situation: behavior didn't match what was
>> expected and there was low risk. There was a long argument and the patch
>> didn't make it into 2.3 (to my knowledge).
>> >> >>>
>> >> >>> What we call these low-risk be

[VOTE] Release Apache Spark 2.4.2

2019-04-18 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version
2.4.2.

The vote is open until April 23 PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.2-rc1 (commit
a44880ba74caab7a987128cb09c4bee41617770a):
https://github.com/apache/spark/tree/v2.4.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1322/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12344996

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.2?
===

The current list of open tickets targeted at 2.4.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: [SPARK-25079] moving from python 3.4 to python 3.6.8, impacts all active branches

2019-04-18 Thread Bryan Cutler
Great work, thanks Shane!

On Thu, Apr 18, 2019 at 2:46 PM shane knapp  wrote:

> alrighty folks, the future is here and we'll be moving to python 3.6
> monday!
>
> all three PRs are green!
> master PR:  https://github.com/apache/spark/pull/24266
> 2.4 PR:  https://github.com/apache/spark/pull/24379
> 2.3 PR:  https://github.com/apache/spark/pull/24380
>
> more detailed email coming out this afternoon about the upgrade.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


[SPARK-25079][build system] the future of python3.6 is upon us!

2019-04-18 Thread shane knapp
well, upon us on monday.  :)

firstly, an important note:  if you have an open PR, please check to see if
you need to rebase my changes on it before testing.

monday @ 11am PST, i will begin.  in order:

0) jenkins enters quiet mode, running PRB builds cancelled

1)  existing p3k env on all workers will be updated to python3.6  [1]
1a)  spot-check for the random 'us/pacific-new' bug

3)  remove the TODOs from the three PRs and merge

4)  jenkins exits quiet mode, builds launch

5)  ~5 hours later i'll check back in and make sure we're good.  :)

steps 1-4 shouldn't take more than an hour and i really expect things to be
back up and running pretty quickly.  i will send updates as needed.

shane

1--   this will be for 2.3/2.4 only, and tests against pandas 0.19.2 and
pyarrow 0.8.0.  master tests against pandas 0.23.2 and pyarrow 0.12.1
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SPARK-25079] moving from python 3.4 to python 3.6.8, impacts all active branches

2019-04-18 Thread shane knapp
alrighty folks, the future is here and we'll be moving to python 3.6 monday!

all three PRs are green!
master PR:  https://github.com/apache/spark/pull/24266
2.4 PR:  https://github.com/apache/spark/pull/24379
2.3 PR:  https://github.com/apache/spark/pull/24380

more detailed email coming out this afternoon about the upgrade.

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Open PRs RE: Datasets Typed by Arbitrary Avro

2019-04-18 Thread Aleksander Eskilson
There are now a couple different pull-requests each attempting to address
the need for an enhancement providing Typed Dataset support for Avro
Objects. These PRs and their respective JIRA tickets are

   - https://github.com/apache/spark/pull/22878 :
   https://issues.apache.org/jira/browse/SPARK-25789 (originally in
   Databricks/spark-avro, https://github.com/databricks/spark-avro/pull/217
: https://github.com/databricks/spark-avro/issues/169)
   - https://github.com/apache/spark/pull/24299 :
   https://issues.apache.org/jira/browse/SPARK-27388
   - https://github.com/apache/spark/pull/24367 :
   https://issues.apache.org/jira/browse/SPARK-27457

Approaches between these differ considerably, and respective coverages may
not be equal. Some analysis of tradeoffs and perhaps a deeper analysis of
workarounds would be necessary.

Full disclosure, I contributed significantly to Spark#22878/Spark-Avro#217,
so I don't think I'll say more about the topics in this thread, but I would
be looking to Spark committers for some more direction either here or in
the PR threads. I'd be happy to be respond to questions from the community.

The topic of and request for Typed Datasets of Avro goes back to
Spark-Avro#169 . I saw
relatively recently that project was folded into Spark-proper, but the need
for Statically type, Dataset support (as opposed to dynamically typed
Dataframe support) continues.

Hoping a resolution can come out of this visibility.

Aleksander Eskilson
https://github.com/bdrillard


Re: Spark 2.4.2

2019-04-18 Thread Felix Cheung
Re shading - same argument I’ve made earlier today in a PR...

(Context- in many cases Spark has light or indirect dependencies but bringing 
them into the process breaks users code easily)



From: Michael Heuer 
Sent: Thursday, April 18, 2019 6:41 AM
To: Reynold Xin
Cc: Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen Fan; Xiao Li
Subject: Re: Spark 2.4.2

+100


On Apr 18, 2019, at 1:48 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:

We should have shaded all Spark’s dependencies :(

On Wed, Apr 17, 2019 at 11:47 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
For users that would inherit Jackson and use it directly, or whose
dependencies do. Spark itself (with modifications) should be OK with
the change.
It's risky and normally wouldn't backport, except that I've heard a
few times about concerns about CVEs affecting Databind, so wondering
who else out there might have an opinion. I'm not pushing for it
necessarily.

On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
>
> For Jackson - are you worrying about JSON parsing for users or internal Spark 
> functionality breaking?
>
> On Wed, Apr 17, 2019 at 6:02 PM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
>>
>> There's only one other item on my radar, which is considering updating
>> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
>> a few times now that there are a number of CVEs open for 2.6.7. Cons:
>> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
>> behavior non-trivially. That said back-porting the update PR to 2.4
>> worked out OK locally. Any strong opinions on this one?
>>
>> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
>> mailto:cloud0...@gmail.com>> wrote:
>> >
>> > I volunteer to be the release manager for 2.4.2, as I was also going to 
>> > propose 2.4.2 because of the reverting of SPARK-25250. Is there any other 
>> > ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the 
>> > release process today (CST).
>> >
>> > Thanks,
>> > Wenchen
>> >
>> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen 
>> > mailto:sro...@gmail.com>> wrote:
>> >>
>> >> I think the 'only backport bug fixes to branches' principle remains 
>> >> sound. But what's a bug fix? Something that changes behavior to match 
>> >> what is explicitly supposed to happen, or implicitly supposed to happen 
>> >> -- implied by what other similar things do, by reasonable user 
>> >> expectations, or simply how it worked previously.
>> >>
>> >> Is this a bug fix? I guess the criteria that matches is that behavior 
>> >> doesn't match reasonable user expectations? I don't know enough to have a 
>> >> strong opinion. I also don't think there is currently an objection to 
>> >> backporting it, whatever it's called.
>> >>
>> >>
>> >> Is the question whether this needs a new release? There's no harm in 
>> >> another point release, other than needing a volunteer release manager. 
>> >> One could say, wait a bit longer to see what more info comes in about 
>> >> 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move 
>> >> towards a release cycle again. I don't see objection to that either (?)
>> >>
>> >>
>> >> The meta question remains: is a 'bug fix' definition even agreed, and 
>> >> being consistently applied? There aren't correct answers, only best 
>> >> guesses from each person's own experience, judgment and priorities. These 
>> >> can differ even when applied in good faith.
>> >>
>> >> Sometimes the variance of opinion comes because people have different 
>> >> info that needs to be surfaced. Here, maybe it's best to share what about 
>> >> that offline conversation was convincing, for example.
>> >>
>> >> I'd say it's also important to separate what one would prefer from what 
>> >> one can't live with(out). Assuming one trusts the intent and experience 
>> >> of the handful of others with an opinion, I'd defer to someone who wants 
>> >> X and will own it, even if I'm moderately against it. Otherwise we'd get 
>> >> little done.
>> >>
>> >> In that light, it seems like both of the PRs at issue here are not 
>> >> _wrong_ to backport. This is a good pair that highlights why, when there 
>> >> isn't a clear reason to do / not do something (e.g. obvious errors, 
>> >> breaking public APIs) we give benefit-of-the-doubt in order to get it 
>> >> later.
>> >>
>> >>
>> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
>> >> mailto:rb...@netflix.com.invalid>> wrote:
>> >>>
>> >>> Sorry, I should be more clear about what I'm trying to say here.
>> >>>
>> >>> In the past, Xiao has taken the opposite stance. A good example is PR 
>> >>> #21060 that was a very similar situation: behavior didn't match what was 
>> >>> expected and there was low risk. There was a long argument and the patch 
>> >>> didn't make it into 2.3 (to my knowledge).
>> >>>
>> >>> What we call these low-risk behavior fixes doesn't matter. I called it a 
>> >>> bug on 

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-18 Thread Jason Lowe
+1 (non-binding).  Looking forward to seeing better support for processing
columnar data.

Jason

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
wrote:

> Hi everyone,
>
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
>
> You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks!
> Tom Graves
>


Re: Spark 2.4.2

2019-04-18 Thread Michael Heuer
+100


> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
> 
> We should have shaded all Spark’s dependencies :(
> 
> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  > wrote:
> For users that would inherit Jackson and use it directly, or whose
> dependencies do. Spark itself (with modifications) should be OK with
> the change.
> It's risky and normally wouldn't backport, except that I've heard a
> few times about concerns about CVEs affecting Databind, so wondering
> who else out there might have an opinion. I'm not pushing for it
> necessarily.
> 
> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  > wrote:
> >
> > For Jackson - are you worrying about JSON parsing for users or internal 
> > Spark functionality breaking?
> >
> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  > > wrote:
> >>
> >> There's only one other item on my radar, which is considering updating
> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
> >> behavior non-trivially. That said back-porting the update PR to 2.4
> >> worked out OK locally. Any strong opinions on this one?
> >>
> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan  >> > wrote:
> >> >
> >> > I volunteer to be the release manager for 2.4.2, as I was also going to 
> >> > propose 2.4.2 because of the reverting of SPARK-25250. Is there any 
> >> > other ongoing bug fixes we want to include in 2.4.2? If no I'd like to 
> >> > start the release process today (CST).
> >> >
> >> > Thanks,
> >> > Wenchen
> >> >
> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  >> > > wrote:
> >> >>
> >> >> I think the 'only backport bug fixes to branches' principle remains 
> >> >> sound. But what's a bug fix? Something that changes behavior to match 
> >> >> what is explicitly supposed to happen, or implicitly supposed to happen 
> >> >> -- implied by what other similar things do, by reasonable user 
> >> >> expectations, or simply how it worked previously.
> >> >>
> >> >> Is this a bug fix? I guess the criteria that matches is that behavior 
> >> >> doesn't match reasonable user expectations? I don't know enough to have 
> >> >> a strong opinion. I also don't think there is currently an objection to 
> >> >> backporting it, whatever it's called.
> >> >>
> >> >>
> >> >> Is the question whether this needs a new release? There's no harm in 
> >> >> another point release, other than needing a volunteer release manager. 
> >> >> One could say, wait a bit longer to see what more info comes in about 
> >> >> 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move 
> >> >> towards a release cycle again. I don't see objection to that either (?)
> >> >>
> >> >>
> >> >> The meta question remains: is a 'bug fix' definition even agreed, and 
> >> >> being consistently applied? There aren't correct answers, only best 
> >> >> guesses from each person's own experience, judgment and priorities. 
> >> >> These can differ even when applied in good faith.
> >> >>
> >> >> Sometimes the variance of opinion comes because people have different 
> >> >> info that needs to be surfaced. Here, maybe it's best to share what 
> >> >> about that offline conversation was convincing, for example.
> >> >>
> >> >> I'd say it's also important to separate what one would prefer from what 
> >> >> one can't live with(out). Assuming one trusts the intent and experience 
> >> >> of the handful of others with an opinion, I'd defer to someone who 
> >> >> wants X and will own it, even if I'm moderately against it. Otherwise 
> >> >> we'd get little done.
> >> >>
> >> >> In that light, it seems like both of the PRs at issue here are not 
> >> >> _wrong_ to backport. This is a good pair that highlights why, when 
> >> >> there isn't a clear reason to do / not do something (e.g. obvious 
> >> >> errors, breaking public APIs) we give benefit-of-the-doubt in order to 
> >> >> get it later.
> >> >>
> >> >>
> >> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue  
> >> >> wrote:
> >> >>>
> >> >>> Sorry, I should be more clear about what I'm trying to say here.
> >> >>>
> >> >>> In the past, Xiao has taken the opposite stance. A good example is PR 
> >> >>> #21060 that was a very similar situation: behavior didn't match what 
> >> >>> was expected and there was low risk. There was a long argument and the 
> >> >>> patch didn't make it into 2.3 (to my knowledge).
> >> >>>
> >> >>> What we call these low-risk behavior fixes doesn't matter. I called it 
> >> >>> a bug on #21060 but I'm applying Xiao's previous definition here to 
> >> >>> make a point. Whatever term we use, we clearly have times when we want 
> >> >>> to allow a patch because it is low risk and helps someone. Let's just 
> >> >>> be clear that that's perfectly fine.
> >> >>>
> >> >

Re: Thoughts on dataframe cogroup?

2019-04-18 Thread Chris Martin
Yes, totally agreed with Li here.

For clarity, I'm happy to do the work to implement this, but it would be
good to get feedback from the community in general and some of the Spark
committers in particular.

thanks,

Chris

On Wed, Apr 17, 2019 at 9:17 PM Li Jin  wrote:

> I have left some comments. This looks a good proposal to me.
>
> As a heavy pyspark user, this is a pattern that we see over and over again
> and I think could be pretty high value to other pyspark users as well. The
> fact that Chris and I come to same ideas sort of verifies my intuition.
> Also, this isn't really something new, RDD has cogroup function from very
> early on.
>
> With that being said, I'd like to call out again for community's feedback
> on the proposal.
>
> On Mon, Apr 15, 2019 at 4:57 PM Chris Martin 
> wrote:
>
>> Ah sorry- I've updated the link which should give you access.  Can you
>> try again now?
>>
>> thanks,
>>
>> Chris
>>
>>
>>
>> On Mon, Apr 15, 2019 at 9:49 PM Li Jin  wrote:
>>
>>> Hi Chris,
>>>
>>> Thanks! The permission to the google doc is maybe not set up properly. I
>>> cannot view the doc by default.
>>>
>>> Li
>>>
>>> On Mon, Apr 15, 2019 at 3:58 PM Chris Martin 
>>> wrote:
>>>
 I've updated the jira so that the main body is now inside a google
 doc.  Anyone should be able to comment- if you want/need write access
 please drop me a mail and I can add you.

 Ryan- regarding your specific point regarding why I'm not proposing to
 add this to the Scala API, I think the main point is that Scala users can
 already use Cogroup for Datasets.  For Scala this is probably a better
 solution as (as far as I know) there is no Scala DataFrame library that
 could be used in place of Pandas for manipulating  local DataFrames. As a
 result you'd probably be left with dealing with Iterators of Row objects,
 which almost certainly isn't what you'd want. This is similar to the
 existing grouped map Pandas Udfs for which there is no equivalent Scala 
 Api.

 I do think there might be a place for allowing a (Scala) DataSet
 Cogroup to take some sort of grouping expression as the grouping key  (this
 would mean that you wouldn't have to marshal the key into a JVM object and
 could possible lend itself to some catalyst optimisations) but I don't
 think that this should be done as part of this SPIP.

 thanks,

 Chris

 On Mon, Apr 15, 2019 at 6:27 PM Ryan Blue  wrote:

> I agree, it would be great to have a document to comment on.
>
> The main thing that stands out right now is that this is only for
> PySpark and states that it will not be added to the Scala API. Why not 
> make
> this available since most of the work would be done?
>
> On Mon, Apr 15, 2019 at 7:50 AM Li Jin  wrote:
>
>> Thank you Chris, this looks great.
>>
>> Would you mind share a google doc version of the proposal? I believe
>> that's the preferred way of discussing proposals (Other people please
>> correct me if I am wrong).
>>
>> Li
>>
>> On Mon, Apr 15, 2019 at 8:20 AM  wrote:
>>
>>> Hi,
>>>
>>>  As promised I’ve raised SPARK-27463 for this.
>>>
>>> All feedback welcome!
>>>
>>> Chris
>>>
>>> On 9 Apr 2019, at 13:22, Chris Martin  wrote:
>>>
>>> Thanks Bryan and Li, that is much appreciated.  Hopefully should
>>> have the SPIP ready in the next couple of days.
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>> On Mon, Apr 8, 2019 at 7:18 PM Bryan Cutler 
>>> wrote:
>>>
 Chirs, an SPIP sounds good to me. I agree with Li that it wouldn't
 be too difficult to extend the currently functionality to transfer 
 multiple
 DataFrames.  For the SPIP, I would keep it more high-level and I don't
 think it's necessary to include details of the Python worker, we can 
 hash
 that out after the SPIP is approved.

 Bryan

 On Mon, Apr 8, 2019 at 10:43 AM Li Jin 
 wrote:

> Thanks Chris, look forward to it.
>
> I think sending multiple dataframes to the python worker requires
> some changes but shouldn't be too difficult. We can probably sth like:
>
>
> [numberOfDataFrames][FirstDataFrameInArrowFormat][SecondDataFrameInArrowFormat]
>
> In:
> https://github.com/apache/spark/blob/86d469aeaa492c0642db09b27bb0879ead5d7166/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L70
>
> And have ArrowPythonRunner take multiple input iterator/schema.
>
> Li
>
>
> On Mon, Apr 8, 2019 at 5:55 AM  wrote:
>
>> Hi,
>>
>> Just to say, I really do think this is useful and am currently
>> working on a SPIP to form

FW: JDK vs JRE in Docker Images

2019-04-18 Thread Rob Vesse
Sean

Thanks for the pointers.

Janino specifically says it only requires a JRE - 
https://janino-compiler.github.io/janino/#requirements

As for scalac can't find a specific reference anywhere, appears to be 
self-contained AFAICT

Rob

On 17/04/2019, 18:56, "Sean Owen"  wrote:

I confess I don't know, but I don't think scalac or janino need javac
and related tools, and those are the only things that come to mind. If
the tests pass without a JDK, that's good evidence.

On Wed, Apr 17, 2019 at 8:49 AM Rob Vesse  wrote:
>
> Folks
>
>
>
> For those using the Kubernetes support and building custom images are 
you using a JDK or a JRE in the container images?
>
>
>
> Using a JRE saves a reasonable chunk of image size (about 50MB with 
our preferred Linux distro) but I didn’t want to make this change if there was 
a reason to have a JDK available.  Certainly the official project integration 
tests run just fine with a JRE based image
>
>
>
> Currently the projects official Docker files use openjdk:8-alpine as 
a base which includes a full JDK so didn’t know if that was intentional or just 
convenience?
>
>
>
> Thanks,
>
>
>
> Rob

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org












-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org