date:20180918

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Yinan Li

FYI: SPARK-23200 has been resolved.

On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung 
wrote:

> If we could work on this quickly - it might get on to future RCs.
>
>
>
> --
> *From:* Stavros Kontopoulos 
> *Sent:* Monday, September 17, 2018 2:35 PM
> *To:* Yinan Li
> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean
> Owen; Wenchen Fan; dev
> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>
> Hi Xiao,
>
> I just tested it, it seems ok. There are some questions about which
> properties we should keep when restoring the config. Otherwise it looks ok
> to me.
> The reason this should go in 2.4 is that streaming on k8s is something
> people want to try day one (or at least it is cool to try) and since 2.4
> comes with k8s support being refactored a lot,
> it would be disappointing not to have it in...IMHO.
>
> Best,
> Stavros
>
> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li  wrote:
>
>> We can merge the PR and get SPARK-23200 resolved if the whole point is to
>> make streaming on k8s work first. But given that this is not a blocker for
>> 2.4, I think we can take a bit more time here and get it right. With that
>> being said, I would expect it to be resolved soon.
>>
>> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:
>>
>>> Hi, Erik and Stavros,
>>>
>>> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds
>>> important for the Streaming on K8S. Could the K8S oriented committers speed
>>> up the reviews?
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Erik Erlandson  于2018年9月17日周一 上午11:04写道：
>>>

 I have no binding vote but I second Stavros’ recommendation for
 spark-23200

 Per parallel threads on Py2 support I would also like to propose
 deprecating Py2 starting with this 2.4 release

 On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
  wrote:

> You can log in to https://repository.apache.org and see what's wrong.
> Just find that staging repo and look at the messages. In your case it
> seems related to your signature.
>
> failureMessageNo public key: Key with id: () was not able to be
> located on http://gpg-keyserver.de/. Upload your public key and try
> the operation again.
> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
> wrote:
> >
> > I confirmed that
> https://repository.apache.org/content/repositories/orgapachespark-1285
> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
> error message during it.
> >
> > Any insights are appreciated! So that I can fix it in the next RC.
> Thanks!
> >
> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
> wrote:
> >>
> >> I think one build is enough, but haven't thought it through. The
> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of
> it?
> >> Really, whatever's the easy thing to do.
> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
> wrote:
> >> >
> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing 
> for
> Scala 2.12?
> >> >
> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
> wrote:
> >> >>
> >> >> A few preliminary notes:
> >> >>
> >> >> Wenchen for some weird reason when I hit your key in gpg
> --import, it
> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
> verify
> >> >> the signature. No issue there really.
> >> >>
> >> >> The staging repo gives a 404:
> >> >>
> https://repository.apache.org/content/repositories/orgapachespark-1285/
> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
> >> >> [id=orgapachespark-1285] exists but is not exposed.
> >> >>
> >> >> The (revamped) licenses are OK, though there are some minor
> glitches
> >> >> in the final release tarballs (my fault) : there's an extra
> directory,
> >> >> and the source release has both binary and source licenses. I'll
> fix
> >> >> that. Not strictly necessary to reject the release over those.
> >> >>
> >> >> Last, when I check the staging repo I'll get my answer, but,
> were you
> >> >> able to build 2.12 artifacts as well?
> >> >>
> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
> wrote:
> >> >> >
> >> >> > Please vote on releasing the following candidate as Apache
> Spark version 2.4.0.
> >> >> >
> >> >> > The vote is open until September 20 PST and passes if a
> majority +1 PMC votes are cast, with
> >> >> > a minimum of 3 +1 votes.
> >> >> >
> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
> >> >>

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-18 Thread Dongjoon Hyun

+1.

I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserve`
on OpenJDK(1.8.0_181)/CentOS 7.5.

I hit the following test case failure once during testing, but it's not
persistent.

KafkaContinuousSourceSuite
...
subscribing topic by name from earliest offsets (failOnDataLoss: false)
*** FAILED ***

Thank you, Saisai.

Bests,
Dongjoon.

On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao  wrote:

> +1 from my own side.
>
> Thanks
> Saisai
>
> Wenchen Fan  于2018年9月18日周二 上午9:34写道：
>
>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>
>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>
>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>> build from source with most profiles passed for me.
>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.2.
>>> >
>>> > The vote is open until September 21 PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>>> >
>>> > The list of bug fixes going into 2.3.2 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>> >
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.3.2 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.3.2
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Felix Cheung

If we could work on this quickly - it might get on to future RCs.




From: Stavros Kontopoulos 
Sent: Monday, September 17, 2018 2:35 PM
To: Yinan Li
Cc: Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean Owen; 
Wenchen Fan; dev
Subject: Re: [VOTE] SPARK 2.4.0 (RC1)

Hi Xiao,

I just tested it, it seems ok. There are some questions about which properties 
we should keep when restoring the config. Otherwise it looks ok to me.
The reason this should go in 2.4 is that streaming on k8s is something people 
want to try day one (or at least it is cool to try) and since 2.4 comes with 
k8s support being refactored a lot,
it would be disappointing not to have it in...IMHO.

Best,
Stavros

On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
mailto:liyinan...@gmail.com>> wrote:
We can merge the PR and get SPARK-23200 resolved if the whole point is to make 
streaming on k8s work first. But given that this is not a blocker for 2.4, I 
think we can take a bit more time here and get it right. With that being said, 
I would expect it to be resolved soon.

On Mon, Sep 17, 2018 at 11:47 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Erik and Stavros,

This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds 
important for the Streaming on K8S. Could the K8S oriented committers speed up 
the reviews?

Thanks,

Xiao

Erik Erlandson mailto:eerla...@redhat.com>> 于2018年9月17日周一 
上午11:04写道：

I have no binding vote but I second Stavros’ recommendation for spark-23200

Per parallel threads on Py2 support I would also like to propose deprecating 
Py2 starting with this 2.4 release

On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin  
wrote:
You can log in to https://repository.apache.org and see what's wrong.
Just find that staging repo and look at the messages. In your case it
seems related to your signature.

failureMessageNo public key: Key with id: () was not able to be
located on http://gpg-keyserver.de/. Upload your public key and try
the operation again.
On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
>
> I confirmed that 
> https://repository.apache.org/content/repositories/orgapachespark-1285 is not 
> accessible. I did it via ./dev/create-release/do-release-docker.sh -d 
> /my/work/dir -s publish , not sure what's going wrong. I didn't see any error 
> message during it.
>
> Any insights are appreciated! So that I can fix it in the next RC. Thanks!
>
> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
> mailto:sro...@apache.org>> wrote:
>>
>> I think one build is enough, but haven't thought it through. The
>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>> Really, whatever's the easy thing to do.
>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>> mailto:cloud0...@gmail.com>> wrote:
>> >
>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala 
>> > 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop 
>> > 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for Scala 
>> > 2.12?
>> >
>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>> > mailto:sro...@apache.org>> wrote:
>> >>
>> >> A few preliminary notes:
>> >>
>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>> >> asks for a passphrase. When I skip it, it's fine, gpg can still verify
>> >> the signature. No issue there really.
>> >>
>> >> The staging repo gives a 404:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> [id=orgapachespark-1285] exists but is not exposed.
>> >>
>> >> The (revamped) licenses are OK, though there are some minor glitches
>> >> in the final release tarballs (my fault) : there's an extra directory,
>> >> and the source release has both binary and source licenses. I'll fix
>> >> that. Not strictly necessary to reject the release over those.
>> >>
>> >> Last, when I check the staging repo I'll get my answer, but, were you
>> >> able to build 2.12 artifacts as well?
>> >>
>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>> >> mailto:cloud0...@gmail.com>> wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark 
>> >> > version 2.4.0.
>> >> >
>> >> > The vote is open until September 20 PST and passes if a majority +1 PMC 
>> >> > votes are cast, with
>> >> > a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v2.4.0-rc1 (commit 
>> >> > 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found at:
>> >> >

Re: Python friendly API for Spark 3.0

2018-09-18 Thread Erik Erlandson

I like the notion of empowering cross platform bindings.

The trend of computing frameworks seems to be that all APIs gradually
converge on a stable attractor which could be described as "data frames and
SQL"  Spark's early API design was RDD focused, but these days the center
of gravity is all about DataFrame (Python's prevalence combined with its
lack of a static type system substantially dilutes the benefits of DataSet,
for any library development that aspires to both JVM and python support).

I can imagine optimizing the developer layers of Spark APIs so that cross
platform support and also 3rd-party support for new and existing Spark
bindings would be maximized for "parallelizable dataframe+SQL"  Another of
Spark's strengths is it's ability to federate heterogeneous data sources,
and making cross platform bindings easy for that is desirable.


On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra 
wrote:

> It's not splitting hairs, Erik. It's actually very close to something that
> I think deserves some discussion (perhaps on a separate thread.) What I've
> been thinking about also concerns API "friendliness" or style. The original
> RDD API was very intentionally modeled on the Scala parallel collections
> API. That made it quite friendly for some Scala programmers, but not as
> much so for users of the other language APIs when they eventually came
> about. Similarly, the Dataframe API drew a lot from pandas and R, so it is
> relatively friendly for those used to those abstractions. Of course, the
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new
> barrier scheduling draws inspiration from MPI. With all of these models and
> sources of inspiration, as well as multiple language targets, there isn't
> really a strong sense of coherence across Spark -- I mean, even though one
> of the key advantages of Spark is the ability to do within a single
> framework things that would otherwise require multiple frameworks, actually
> doing that is requiring more than one programming style or multiple design
> abstractions more than what is strictly necessary even when writing Spark
> code in just a single language.
>
> For me, that raises questions over whether we want to start designing,
> implementing and supporting APIs that are designed to be more consistent,
> friendly and idiomatic to particular languages and abstractions -- e.g. an
> API covering all of Spark that is designed to look and feel as much like
> "normal" code for a Python programmer, another that looks and feels more
> like "normal" Java code, another for Scala, etc. That's a lot more work and
> support burden than the current approach where sometimes it feels like you
> are writing "normal" code for your prefered programming environment, and
> sometimes it feels like you are trying to interface with something foreign,
> but underneath it hopefully isn't too hard for those writing the
> implementation code below the APIs, and it is not too hard to maintain
> multiple language bindings that are each fairly lightweight.
>
> It's a cost-benefit judgement, of course, whether APIs that are heavier
> (in terms of implementing and maintaining) and friendlier (for end users)
> are worth doing, and maybe some of these "friendlier" APIs can be done
> outside of Spark itself (imo, Frameless is doing a very nice job for the
> parts of Spark that it is currently covering -- https://github.com/
> typelevel/frameless); but what we have currently is a bit too ad hoc and
> fragmentary for my taste.
>
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
> wrote:
>
>> I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>>
>> They're not mutually exclusive of course, and both worth working on. But
>> it's *possible* to improve either without the other.
>>
>> Stub files look like a great solution for type annotations, maybe even if
>> only python 3 is supported.
>>
>> I definitely agree that any decision to drop python 2 should not be taken
>> lightly. Anecdotally, I'm seeing an increase in python developers
>> announcing that they are dropping support for python 2 (and loving it). As
>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>> we're stuck with it until 4.0, which would place spark in a
>> possibly-awkward position of supporting python 2 for some time after it
>> goes EOL.
>>
>> Under the current release cadence, spark 3.0 will land some time in early
>> 2019, which at that point will be mere months until EOL for py2.
>>
>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
>> wrote:
>>
>>>
>>>
>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
>>> wrote:
>>>
 To be clear, is this about "python-friendly API" or "friendly python
 API" ?

>>> Well what would you

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Wenchen Fan

Thanks Marcelo to point out my gpg key issue! I've re-generated it and
uploaded to ASF spark repo. Let's see if it works in the next RC.

Thanks Saisai to point out the Python doc issue, I'll fix it in the next RC.

This RC fails because:
1. it doesn't include a Scala 2.12 build
2. the gpg key issue
3. the Python doc issue
4. some other potential blocker issues.

I'll start RC2 once these blocker issues are either resolved or we decide
to mark them as non-blocker.

Thanks,
Wenchen

On Tue, Sep 18, 2018 at 9:48 PM Marco Gaido  wrote:

> Sorry but I am -1 because of what was reported here:
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104
> .
> It is a regression unfortunately. Despite the impact is not huge and there
> are workarounds, I think we should include the fix in 2.4.0. I created
> SPARK-25454 and submitted a PR for it.
> Sorry for the trouble.
>
> Il giorno mar 18 set 2018 alle ore 05:23 Holden Karau <
> hol...@pigscanfly.ca> ha scritto:
>
>> Deprecating Py 2 in the 2.4 release probably doesn't belong in the RC
>> vote thread. Personally I think we might be a little too late in the game
>> to deprecate it in 2.4, but I think calling it out as "soon to be
>> deprecated" in the release docs would be sensible to give folks extra time
>> to prepare.
>>
>> On Mon, Sep 17, 2018 at 2:04 PM Erik Erlandson 
>> wrote:
>>
>>>
>>> I have no binding vote but I second Stavros’ recommendation for
>>> spark-23200
>>>
>>> Per parallel threads on Py2 support I would also like to propose
>>> deprecating Py2 starting with this 2.4 release
>>>
>>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>>  wrote:
>>>
 You can log in to https://repository.apache.org and see what's wrong.
 Just find that staging repo and look at the messages. In your case it
 seems related to your signature.

 failureMessageNo public key: Key with id: () was not able to be
 located on http://gpg-keyserver.de/. Upload your public key and try
 the operation again.
 On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
 wrote:
 >
 > I confirmed that
 https://repository.apache.org/content/repositories/orgapachespark-1285
 is not accessible. I did it via ./dev/create-release/do-release-docker.sh
 -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
 error message during it.
 >
 > Any insights are appreciated! So that I can fix it in the next RC.
 Thanks!
 >
 > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
 >>
 >> I think one build is enough, but haven't thought it through. The
 >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
 >> best advertised as a 'beta'. So maybe publish a no-hadoop build of
 it?
 >> Really, whatever's the easy thing to do.
 >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
 wrote:
 >> >
 >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
 Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
 hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
 Scala 2.12?
 >> >
 >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
 wrote:
 >> >>
 >> >> A few preliminary notes:
 >> >>
 >> >> Wenchen for some weird reason when I hit your key in gpg
 --import, it
 >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
 verify
 >> >> the signature. No issue there really.
 >> >>
 >> >> The staging repo gives a 404:
 >> >>
 https://repository.apache.org/content/repositories/orgapachespark-1285/
 >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
 >> >> [id=orgapachespark-1285] exists but is not exposed.
 >> >>
 >> >> The (revamped) licenses are OK, though there are some minor
 glitches
 >> >> in the final release tarballs (my fault) : there's an extra
 directory,
 >> >> and the source release has both binary and source licenses. I'll
 fix
 >> >> that. Not strictly necessary to reject the release over those.
 >> >>
 >> >> Last, when I check the staging repo I'll get my answer, but, were
 you
 >> >> able to build 2.12 artifacts as well?
 >> >>
 >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
 wrote:
 >> >> >
 >> >> > Please vote on releasing the following candidate as Apache
 Spark version 2.4.0.
 >> >> >
 >> >> > The vote is open until September 20 PST and passes if a
 majority +1 PMC votes are cast, with
 >> >> > a minimum of 3 +1 votes.
 >> >> >
 >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
 >> >> > [ ] -1 Do not release this package because ...
 >> >> >
 >> >> > To learn more about Apache Spark, please see
 http://spark.apache.org/
 >> >> >
 >> >> > The tag to be voted on is

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-18 Thread Thakrar, Jayesh

Totally agree with you Dale, that there are situations for efficiency, 
performance and better control/visibility/manageability that we need to expose 
partition management.

So as described, I suggested two things - the ability to do it in the current 
V2 API form via options and appropriate implementation in datasource 
reader/writer.

And for long term, suggested that partition management can be made part of 
metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog support)?


On 9/17/18, 8:26 PM, "tigerquoll"  wrote:

Hi Jayesh,
I get where you are coming from - partitions are just an implementation
optimisation that we really shouldn’t be bothering the end user with. 
Unfortunately that view is like saying RPC is like a procedure call, and
details of the network transport should be hidden from the end user. CORBA
tried this approach for RPC and failed for the same reason that no major
vendor of DBMS systems that support partitions try to hide them from the end
user.  They have a substantial real world effect that is impossible to hide
from the user (in particular when writing/modifying the data source).  Any
attempt to “take care” of partitions automatically invariably guesses wrong
and ends up frustrating the end user (as “substantial real world effect”
turns to “show stopping performance penalty” if the user attempts to fight
against a partitioning scheme she has no idea exists)

So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.

Regards,
Dale.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Marco Gaido

Sorry but I am -1 because of what was reported here:
https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104
.
It is a regression unfortunately. Despite the impact is not huge and there
are workarounds, I think we should include the fix in 2.4.0. I created
SPARK-25454 and submitted a PR for it.
Sorry for the trouble.

Il giorno mar 18 set 2018 alle ore 05:23 Holden Karau 
ha scritto:

> Deprecating Py 2 in the 2.4 release probably doesn't belong in the RC vote
> thread. Personally I think we might be a little too late in the game to
> deprecate it in 2.4, but I think calling it out as "soon to be deprecated"
> in the release docs would be sensible to give folks extra time to prepare.
>
> On Mon, Sep 17, 2018 at 2:04 PM Erik Erlandson 
> wrote:
>
>>
>> I have no binding vote but I second Stavros’ recommendation for
>> spark-23200
>>
>> Per parallel threads on Py2 support I would also like to propose
>> deprecating Py2 starting with this 2.4 release
>>
>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>  wrote:
>>
>>> You can log in to https://repository.apache.org and see what's wrong.
>>> Just find that staging repo and look at the messages. In your case it
>>> seems related to your signature.
>>>
>>> failureMessageNo public key: Key with id: () was not able to be
>>> located on http://gpg-keyserver.de/. Upload your public key and try
>>> the operation again.
>>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > I confirmed that
>>> https://repository.apache.org/content/repositories/orgapachespark-1285
>>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
>>> error message during it.
>>> >
>>> > Any insights are appreciated! So that I can fix it in the next RC.
>>> Thanks!
>>> >
>>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>> >>
>>> >> I think one build is enough, but haven't thought it through. The
>>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>>> >> Really, whatever's the easy thing to do.
>>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>> Scala 2.12?
>>> >> >
>>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>>> wrote:
>>> >> >>
>>> >> >> A few preliminary notes:
>>> >> >>
>>> >> >> Wenchen for some weird reason when I hit your key in gpg --import,
>>> it
>>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>>> verify
>>> >> >> the signature. No issue there really.
>>> >> >>
>>> >> >> The staging repo gives a 404:
>>> >> >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>>> >> >>
>>> >> >> The (revamped) licenses are OK, though there are some minor
>>> glitches
>>> >> >> in the final release tarballs (my fault) : there's an extra
>>> directory,
>>> >> >> and the source release has both binary and source licenses. I'll
>>> fix
>>> >> >> that. Not strictly necessary to reject the release over those.
>>> >> >>
>>> >> >> Last, when I check the staging repo I'll get my answer, but, were
>>> you
>>> >> >> able to build 2.12 artifacts as well?
>>> >> >>
>>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>> wrote:
>>> >> >> >
>>> >> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> >> >> >
>>> >> >> > The vote is open until September 20 PST and passes if a majority
>>> +1 PMC votes are cast, with
>>> >> >> > a minimum of 3 +1 votes.
>>> >> >> >
>>> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>> >> >> > [ ] -1 Do not release this package because ...
>>> >> >> >
>>> >> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >> >
>>> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
>>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>>> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>>> >> >> >
>>> >> >> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>>> >> >> >
>>> >> >> > Signatures used for Spark RCs can be found in this file:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >> >
>>> >> >> > The staging repository for this release can be found at:
>>> >> >> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> >
>>> >> >> > The documentation corresponding to this release can be

Re: [VOTE] SPARK 2.4.0 (RC1)

Re: [VOTE] SPARK 2.3.2 (RC6)

Re: [VOTE] SPARK 2.4.0 (RC1)

Re: Python friendly API for Spark 3.0

Re: [VOTE] SPARK 2.4.0 (RC1)

Re: [Discuss] Datasource v2 support for manipulating partitions

Re: [VOTE] SPARK 2.4.0 (RC1)

7 matches

Site Navigation

Mail list logo

Footer information