Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have some early stuff there but not quite ready to talk about it in public yet (I hope soon though). Will shoot you a separate email on it. On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari wrote: > Thanks for the reply Reynold - Has this shim project started ? > I'd love to contribute to it -

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Abdeali Kothari
Thanks for the reply Reynold - Has this shim project started ? I'd love to contribute to it - as it looks like I have started making a bunch of helper functions to do something similar for my current task and would prefer not doing it in isolation. Was considering making a git repo and pushing stuf

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have been thinking about some of these issues. Some of them are harder to do, e.g. Spark DataFrames are fundamentally immutable, and making the logical plan mutable is a significant deviation from the current paradigm that might confuse the hell out of some users. We are considering building a s

PySpark syntax vs Pandas syntax

2019-03-25 Thread Abdeali Kothari
Hi, I was doing some spark to pandas (and vice versa) conversion because some of the pandas codes we have don't work on huge data. And some spark codes work very slow on small data. It was nice to see that pyspark had some similar syntax for the common pandas operations that the python community i

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xingbo Jiang
+1 on the updated SPIP Xingbo Jiang 于2019年3月26日周二 下午1:32写道: > Hi all, > > Now we have had a few discussions over the updated SPIP, we also updated > the SPIP addressing new feedbacks from some committers. IMO the SPIP is > ready for another round of vote now. > On the updated SPIP, we currently

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xingbo Jiang
Hi all, Now we have had a few discussions over the updated SPIP, we also updated the SPIP addressing new feedbacks from some committers. IMO the SPIP is ready for another round of vote now. On the updated SPIP, we currently have two +1s (from Tom and Xiangrui), everyone else please vote again. Th

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Xiao Li
Thanks, DB! The Hive UDAF fix https://github.com/apache/spark/commit/0cfefa7e864f443cfd76cff8c50617a8afd080fb was merged this weekend. Xiao DB Tsai 于2019年3月25日周一 下午9:46写道: > RC9 was just cut. Will send out another thread once the build is finished. > > Sincerely, > > DB Tsai >

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote: > This thread is to discuss adding in support fo

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Reynold Xin
At some point we should celebrate having the larger RC number ever in Spark ... On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote: > > > > RC9 was just cut. Will send out another thread once the build is finished. > > > > > Sincerely, > > > > DB Tsai > ---

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread DB Tsai
RC9 was just cut. Will send out another thread once the build is finished. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 42E5B25A8F7A82C1 On Mon, Mar 25, 2019 at 5:10 PM Sean Owen wrote: > > That's all merged now. I think y

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Reynold Xin
+1 on doing this in 3.0. On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > I’m +1 if 3.0 > > > >   > *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) > > *Sent:* Monday, March 25, 2019 6:48 PM > *To:* Hyukjin Kwon > *Cc:* dev; Bryan Cutler; Tak

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Felix Cheung
I’m +1 if 3.0 From: Sean Owen Sent: Monday, March 25, 2019 6:48 PM To: Hyukjin Kwon Cc: dev; Bryan Cutler; Takuya UESHIN; shane knapp Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276] I don't know a lot about Arrow here, but seems reasonable

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra wrote: > Maybe. > > And I expect that we will end up doing something based on spark.task.cpus > in the short term. I'd just rather that this SPIP not make it look like > this is the way things should ideally be done. I'd prefer that we be quite > expli

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
Maybe. And I expect that we will end up doing something based on spark.task.cpus in the short term. I'd just rather that this SPIP not make it look like this is the way things should ideally be done. I'd prefer that we be quite explicit in recognizing that this approach is a significant compromise

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
There are certainly use cases where different stages require different number of CPUs or GPUs under an optimal setting. I don't think anyone disagrees that ideally users should be able to do it. We are just dealing with typical engineering trade-offs and see how we break it down into smaller ones.

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Sean Owen
I don't know a lot about Arrow here, but seems reasonable. Is this for Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3 seems right. On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon wrote: > > Hi all, > > We really need to upgrade the minimal version soon. It's actually slowing > do

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread shane knapp
thanks for the heads up... i'll test deploy this tomorrow and see what gotchas turn up. we may need to upgrade from python 3.4 to 3.5 IIRC. On Mon, Mar 25, 2019 at 6:16 PM Hyukjin Kwon wrote: > Hi all, > > We really need to upgrade the minimal version soon. It's actually slowing > down the PyS

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
I remain unconvinced that a default configuration at the application level makes sense even in that case. There may be some applications where you know a priori that almost all the tasks for all the stages for all the jobs will need some fixed number of gpus; but I think the more common cases will

Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Hyukjin Kwon
Hi all, We really need to upgrade the minimal version soon. It's actually slowing down the PySpark dev, for instance, by the overhead that sometimes we need currently to test all multiple matrix of Arrow and Pandas. Also, it currently requires to add some weird hacks or ugly codes. Some bugs exist

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Sean Owen
That's all merged now. I think you're clear to start an RC. On Mon, Mar 25, 2019 at 4:06 PM DB Tsai wrote: > > I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961 > https://github.com/apache/spark/pull/24126 , anything critical that we > have to wait for 2.4.1 release? Thanks! > > Sin

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Sean Owen
This last test failed again, but, I claim we've actually seen it pass: https://github.com/apache/spark/pull/24126#issuecomment-476410462 Would anybody else endorse merging it into 2.4 to proceed? I'll kick of one more test for good measure. On Mon, Mar 25, 2019 at 4:33 PM Sean Owen wrote: > > Don

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
Say if we support per-task resource requests in the future, it would be still inconvenient for users to declare the resource requirements for every single task/stage. So there must be some default values defined somewhere for task resource requirements. "spark.task.cpus" and "spark.task.accelerator

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Sean Owen
Don't wait on this, but, I was going to slip in a message in the 2.4.1 docs saying that Scala 2.11 support is deprecated, as it will be gone in Spark 3. I'll bang that out right now. Still waiting on a clean test build for that last JIRA, but maybe about to happen. On Mon, Mar 25, 2019 at 4:06 PM

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread DB Tsai
I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961 https://github.com/apache/spark/pull/24126 , anything critical that we have to wait for 2.4.1 release? Thanks! Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 42E5

[DISCUSS] Spark Columnar Processing

2019-03-25 Thread Bobby Evans
This thread is to discuss adding in support for data frame processing using an in-memory columnar format compatible with Apache Arrow. My main goal in this is to lay the groundwork so we can add in support for GPU accelerated processing of data frames, but this feature has a number of other benefi

Re: Scala 2.11 support removed for Spark 3.0.0

2019-03-25 Thread Darcy Shen
Cool, Scala 2.12 compiles faster than Scala 2.11 . But it runs slower than Scala 2.11 by default. We may enable some compiler optimization options. On Mon, 25 Mar 2019 23:53:18 +0800 Sean Owen wrote I merged https://github.com/apache/spark/pull/23098 .

Scala 2.11 support removed for Spark 3.0.0

2019-03-25 Thread Sean Owen
I merged https://github.com/apache/spark/pull/23098 . "-Pscala-2.11" won't work anymore in master. I think this shouldn't be a surprise or disruptive as 2.12 is already the default. The change isn't big and I think pretty reliable, but keep an eye out for issues. Shane you are welcome to remove t

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
Of course there is an issue of the perfect becoming the enemy of the good, so I can understand the impulse to get something done. I am left wanting, however, at least something more of a roadmap to a task-level future than just a vague "we may choose to do something more in the future." At the risk

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Tom Graves
+1 on the updated SPIP. Tom On Monday, March 18, 2019, 12:56:22 PM CDT, Xingbo Jiang wrote: Hi all, I updated the SPIP doc and stories, I hope it now contains clear scope of the changes and enough details for SPIP vote.Please review the updated docs, thanks! Xiangrui Meng 于2019年3月6日周三