Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Hyukjin Kwon
The reason is that it is not 100% clear if the root cause in the Sphinx bug is Python 2 and if the workaround is to use Python 3. Xiangrui opened a bug against Sphinx https://github.com/sphinx-doc/sphinx/issues/5142 Here is my observation: - Sphinx seems having a bug that it does not respect

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao
Thanks @Hyukjin Kwon . Yes I'm using python2 to build docs, looks like Python2 with Sphinx has issues. What is the pending thing for this PR ( https://github.com/apache/spark/pull/21659)? I'm planning to cut RC2 once this is merged, do you an ETA for this PR? Hyukjin Kwon 于2018年7月9日周一

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao
Hi Sean, SPARK-24530 is not included in this RC1 release. Actually I'm so familiar with this issue so still using python2 to generate docs. In the JIRA it mentioned that python3 with sphinx could workaround this issue. @Hyukjin Kwon would you please help to clarify? Thanks Saisai Xiao Li

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Xiao Li
Three business days might be too short. Let us open the vote until the end of this Friday (July 13th)? Cheers, Xiao 2018-07-08 10:15 GMT-07:00 Sean Owen : > Just checking that the doc issue in https://issues.apache.org/ > jira/browse/SPARK-24530 is worked around in this release? > > This was

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Reynold Xin
Yes I would just reuse the same function. On Sun, Jul 8, 2018 at 5:01 AM Li Jin wrote: > Hi Linar, > > This seems useful. But perhaps reusing the same function name is better? > > > http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame > >

Re: [DESIGN] Barrier Execution Mode

2018-07-08 Thread Reynold Xin
Xingbo, Please reference the spip and jira ticket next time: [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark On Sun, Jul 8, 2018 at 9:45 AM Xingbo Jiang wrote: > Hi All, > > I would like to invite you to review the design document for Barrier > Execution Mode: > >

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Sean Owen
Just checking that the doc issue in https://issues.apache.org/jira/browse/SPARK-24530 is worked around in this release? This was pointed out as an example of a broken doc: https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression Here it is in

[DESIGN] Barrier Execution Mode

2018-07-08 Thread Xingbo Jiang
Hi All, I would like to invite you to review the design document for Barrier Execution Mode: https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit# TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a major part of the project involves significant

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Li Jin
Hi Linar, This seems useful. But perhaps reusing the same function name is better? http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame Currently createDataFrame takes an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,

[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Linar Savion
We've created a snippet that creates a Spark DF from a RDD of many pandas DFs in a distributed manner that does not require the driver to collect the entire dataset. Early tests show a performance improvement of x6-x10 over using pandasDF->Rows>sparkDF. I've seen that there are some open pull

[VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao
Please vote on releasing the following candidate as Apache Spark version 2.3.2. The vote is open until July 11th PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.3.2 [ ] -1 Do not release this package because ... To