GitHub user zjffdu opened a pull request:
https://github.com/apache/spark/pull/13598
[SPARK-13587] [PYSPARK] Support virtualenv in pyspark
## What changes were proposed in this pull request?
Support virtualenv in pyspark as described in SPARK-13587
## How was this patch tested?
Manually verified
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zjffdu/spark virtualenv
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13598.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13598
----
commit 3b010bb24293a0732366d17af2b6abc09a1e31c0
Author: Jeff Zhang <[email protected]>
Date: 2016-06-07T12:15:50Z
[SPARK-15803][PYSPARK] Support with statement syntax for SparkSession
commit 42d586ca38794c5c30e97fdfd955ba4f1f9f3c45
Author: Jeff Zhang <[email protected]>
Date: 2016-01-29T03:43:24Z
temp save
commit 95dc0fa3007d644af75fe1c28586c01212e0aae6
Author: Jeff Zhang <[email protected]>
Date: 2016-02-01T07:54:54Z
change it to java 7 stule
commit 2f416d4e1e7011f0d8c841bb518e16b964b28cc3
Author: Jeff Zhang <[email protected]>
Date: 2016-02-01T08:14:03Z
minor fix
commit 88e84f4e543d7f9f30e487c90697a8d292ea9acd
Author: Jeff Zhang <[email protected]>
Date: 2016-02-02T01:42:53Z
fix shebang line limitation
commit e596565c6eaf35db49a9e289cb7b239414d22c4d
Author: Jeff Zhang <[email protected]>
Date: 2016-02-02T03:55:38Z
minor refactoring
commit 00b6f8286c5d537c0e287a440464efcae03de1f9
Author: Sandeep Singh <[email protected]>
Date: 2016-06-08T13:51:00Z
[MINOR] Fix Java Lint errors introduced by #13286 and #13280
## What changes were proposed in this pull request?
revived #13464
Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path:
/Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5]
(whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5]
(whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5]
(whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5]
(whitespace) FileTabCharacter: Line contains a tab character.
[ERROR]
src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming)
MethodName: Method name 'Append' must match pattern
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR]
src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming)
MethodName: Method name 'Complete' must match pattern
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR]
src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8]
(imports) UnusedImports: Unused import -
org.apache.parquet.schema.PrimitiveType.
[ERROR]
src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8]
(imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```
## How was this patch tested?
ran `dev/lint-java` locally
Author: Sandeep Singh <[email protected]>
Closes #13559 from techaddict/minor-3.
commit cf161fa7373aa13691dd1186f8180fb931dcc1ba
Author: Eric Liang <[email protected]>
Date: 2016-06-08T23:21:41Z
[SPARK-15735] Allow specifying min time to run in microbenchmarks
## What changes were proposed in this pull request?
This makes microbenchmarks run for at least 2 seconds by default, to allow
some time for jit compilation to kick in.
## How was this patch tested?
Tested manually with existing microbenchmarks. This change is backwards
compatible in that existing microbenchmarks which specified numIters per-case
will still run exactly that number of iterations. Microbenchmarks which
previously overrode defaultNumIters now override minNumIters.
cc hvanhovell
Author: Eric Liang <[email protected]>
Author: Eric Liang <[email protected]>
Closes #13472 from ericl/spark-15735.
commit 1772ef265308ebc449f3bfd3eb9791254815e4f8
Author: prabs <[email protected]>
Date: 2016-06-08T16:22:55Z
[DOCUMENTATION] Fixed target JAR path
## What changes were proposed in this pull request?
Mentioned Scala version in the sbt configuration file is 2.11, so the path
of the target JAR should be `/target/scala-2.11/simple-project_2.11-1.0.jar`
## How was this patch tested?
n/a
Author: prabs <[email protected]>
Author: Prabeesh K <[email protected]>
Closes #13554 from prabeesh/master.
commit 2ab6242820b97d13c2ed205f02ef31024c26701c
Author: Wenchen Fan <[email protected]>
Date: 2016-06-09T05:47:29Z
[SPARK-14670] [SQL] allow updating driver side sql metrics
## What changes were proposed in this pull request?
On the SparkUI right now we have this SQLTab that displays accumulator
values per operator. However, it only displays metrics updated on the
executors, not on the driver. It is useful to also include driver metrics, e.g.
broadcast time.
This is a different version from
https://github.com/apache/spark/pull/12427. This PR sends driver side
accumulator updates right after the updating happens, not at the end of
execution, by a new event.
## How was this patch tested?
new test in `SQLListenerSuite`

Author: Wenchen Fan <[email protected]>
Closes #13189 from cloud-fan/metrics.
commit 87941dd0ff982b475655e3b2c8ccb1073cbd5b6f
Author: Sandeep Singh <[email protected]>
Date: 2016-06-09T06:41:29Z
[MINOR][DOC] In Dataset docs, remove self link to Dataset and add link to
Column
## What changes were proposed in this pull request?
Documentation Fix
## How was this patch tested?
Author: Sandeep Singh <[email protected]>
Closes #13567 from techaddict/minor-4.
commit 55d8b1326454b695c2bb8f6991ef76de27d8af56
Author: Josh Rosen <[email protected]>
Date: 2016-06-09T07:51:24Z
[SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty
.m2 cache
This patch fixes a bug in `./dev/test-dependencies.sh` which caused
spurious failures when the script was run on a machine with an empty `.m2`
cache. The problem was that extra log output from the dependency download was
conflicting with the grep / regex used to identify the classpath in the Maven
output. This patch fixes this issue by adjusting the regex pattern.
Tested manually with the following reproduction of the bug:
```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```
Author: Josh Rosen <[email protected]>
Closes #13568 from JoshRosen/SPARK-12712.
commit 3203dff5903f3db4a4f9e37f57997fe1768bfce2
Author: Kevin Yu <[email protected]>
Date: 2016-06-09T16:50:09Z
[SPARK-15804][SQL] Include metadata in the toStructType
## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include
the metadata when it builds the StructField, so it causes this reported problem
https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK
when spark writes the the dataframe with the metadata to the parquet datasource.
The code path is when spark writes the dataframe to the parquet datasource
through the InsertIntoHadoopFsRelationCommand, spark will build the
WriteRelation container, and it will call the help function 'toStructType' to
create StructType which contains StructField, it should include the metadata
there, otherwise, we will lost the user provide metadata.
## How was this patch tested?
added test case in ParquetQuerySuite.scala
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Author: Kevin Yu <[email protected]>
Closes #13555 from kevinyu98/spark-15804.
commit 7ad10ff4ef5965db4b8b428afc994de80f22c98c
Author: Jeff Zhang <[email protected]>
Date: 2016-06-09T16:54:38Z
[SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property
## What changes were proposed in this pull request?
add method idf to IDF in pyspark
## How was this patch tested?
add unit test
Author: Jeff Zhang <[email protected]>
Closes #13540 from zjffdu/SPARK-15788.
commit 7a3d4f74467f1a4f15b36b133030e6ede2ff7d25
Author: Adam Roberts <[email protected]>
Date: 2016-06-09T09:34:01Z
[SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2
## What changes were proposed in this pull request?
Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7
build profile
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
Existing tests
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating
Hadoop 2.7.0 is not ready for production use
https://hadoop.apache.org/docs/r2.7.0/ states
"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building
upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should
use 2.7.1 release and beyond."
Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building
upon the previous release 2.7.0. This is the next stable release after Apache
Hadoop 2.6.x."
And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building
upon the previous stable release 2.7.1."
I've tested this is OK with Intel hardware and IBM Java 8 so let's test it
with OpenJDK, ideally this will be pushed to branch-2.0 and master.
Author: Adam Roberts <[email protected]>
Closes #13556 from a-roberts/patch-2.
commit 7871794d02103bb701c7e440858dac4002a15e9e
Author: Josh Rosen <[email protected]>
Date: 2016-06-09T18:04:08Z
[SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central
Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but
depends on that fork via a SBT subproject which is cloned from
https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This
unnecessarily slows down the initial build on fresh machines and is also risky
because it risks a build breakage in case that GitHub repository ever changes
or is deleted.
In order to address these issues, I have published a pre-built binary of
our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project`
namespace and have updated Spark's build to use that artifact. This published
artifact was built from
https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains
the contents of ScrapCodes's branch plus an additional patch to configure the
build for artifact publication.
/cc srowen ScrapCodes for review.
Author: Josh Rosen <[email protected]>
Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.
commit 94939397fbf02a7d466e4a3fb06bbc21d42841e0
Author: Josh Rosen <[email protected]>
Date: 2016-06-09T19:32:29Z
[SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set
## What changes were proposed in this pull request?
It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME`
in the build:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/.
It seems that passing `-javabootclasspath` to ScalaDoc using
scala-maven-plugin ends up preventing the Scala library classes from being
added to scalac's internal class path, causing compilation errors while
building doc-jars.
There might be a principled fix to this inside of the scala-maven-plugin
itself, but for now this patch configures the build to omit the
`-javabootclasspath` option during Maven doc-jar generation.
## How was this patch tested?
Tested manually with `build/mvn clean install -DskipTests=true` when
`JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify
that the final POM changes were scoped correctly:
https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653
/cc vanzin and yhuai for review.
Author: Josh Rosen <[email protected]>
Closes #13573 from JoshRosen/SPARK-15839.
commit a496f18f1bdc127a70884aebe21e679755043fe2
Author: Herman van Hovell <[email protected]>
Date: 2016-06-09T23:37:18Z
[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date
functions
## What changes were proposed in this pull request?
The current implementations of `UnixTime` and `FromUnixTime` do not cache
their parser/formatter as much as they could. This PR resolved this issue.
This PR is a take over from https://github.com/apache/spark/pull/13522 and
further optimizes the re-use of the parser/formatter. It also fixes the
improves handling (catching the actual exception instead of `Throwable`). All
credits for this work should go to rajeshbalamohan.
This PR closes https://github.com/apache/spark/pull/13522
## How was this patch tested?
Current tests.
Author: Herman van Hovell <[email protected]>
Author: Rajesh Balamohan <[email protected]>
Closes #13581 from hvanhovell/SPARK-14321.
commit 8bcc187ffaba7ee8856eac127c1f8a6475107922
Author: jerryshao <[email protected]>
Date: 2016-06-10T00:31:19Z
[SPARK-12447][YARN] Only update the states when executor is successfully
launched
The details is described in
https://issues.apache.org/jira/browse/SPARK-12447.
vanzin Please help to review, thanks a lot.
Author: jerryshao <[email protected]>
Closes #10412 from jerryshao/SPARK-12447.
commit a5ae7865945a75aad29ff737ee09064fae6b4522
Author: Prashant Sharma <[email protected]>
Date: 2016-06-10T00:45:37Z
[SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.
Description from JIRA.
In ReplSuite, for a test that can be tested well on just local should not
really have to start a local-cluster. And similarly a test is in-sufficiently
run if it's actually fixing a problem related to a distributed run in
environment with local run.
Existing tests.
Author: Prashant Sharma <[email protected]>
Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix.
commit cfa6510a8bebb3b5e8e32e9e35aeb2736f90daef
Author: Eric Liang <[email protected]>
Date: 2016-06-10T01:05:16Z
[SPARK-15794] Should truncate toString() of very wide plans
## What changes were proposed in this pull request?
With very wide tables, e.g. thousands of fields, the plan output is
unreadable and often causes OOMs due to inefficient string processing. This
truncates all struct and operator field lists to a user configurable threshold
to limit performance impact.
It would also be nice to optimize string generation to avoid these sort of
O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including
expressions), but this is probably too large of a change for 2.0 at this point,
and truncation has other benefits for usability.
## How was this patch tested?
Added a microbenchmark that covers this case particularly well. I also ran
the microbenchmark while varying the truncation threshold.
```
numFields = 5
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0
23364.4 0.1X
numFields = 25
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0
42367.9 0.1X
numFields = 100
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0
104582.0 0.0X
numFields = Infinity
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
[info] java.lang.OutOfMemoryError: Java heap space
```
Author: Eric Liang <[email protected]>
Author: Eric Liang <[email protected]>
Closes #13537 from ericl/truncated-string.
commit 654b89f52d18556f9ef1df920b962155e308538d
Author: Shixiong Zhu <[email protected]>
Date: 2016-06-10T01:45:19Z
[SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream
## What changes were proposed in this pull request?
This PR closes the input stream created in `HDFSMetadataLog.get`
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <[email protected]>
Closes #13583 from zsxwing/leak.
commit 6f5158092cba6b7160a00a9ce10d16ff0e178f02
Author: Reynold Xin <[email protected]>
Date: 2016-06-10T01:58:24Z
[SPARK-15850][SQL] Remove function grouping in SparkSession
## What changes were proposed in this pull request?
SparkSession does not have that many functions due to better namespacing,
and as a result we probably don't need the function grouping. This patch
removes the grouping and also adds missing scaladocs for createDataset
functions in SQLContext.
Closes #13577.
## How was this patch tested?
N/A - this is a documentation change.
Author: Reynold Xin <[email protected]>
Closes #13582 from rxin/SPARK-15850.
commit 5cf4ae4bee638e2aed8e6722057a8d32909b056e
Author: Eric Liang <[email protected]>
Date: 2016-06-10T05:28:31Z
[SPARK-15791] Fix NPE in ScalarSubquery
## What changes were proposed in this pull request?
The fix is pretty simple, just don't make the executedPlan transient in
`ScalarSubquery` since it is referenced at execution time.
## How was this patch tested?
I verified the fix manually in non-local mode. It's not clear to me why the
problem did not manifest in local mode, any suggestions?
cc davies
Author: Eric Liang <[email protected]>
Closes #13569 from ericl/fix-scalar-npe.
commit 95b8af7a15ce6a9f2070b7874d72957316cbae02
Author: Dongjoon Hyun <[email protected]>
Date: 2016-06-10T05:46:51Z
[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order
## What changes were proposed in this pull request?
Currently, `crosstab` returns a Dataframe having **random-order** columns
obtained by just `distinct`. Also, the documentation of `crosstab` shows the
result in a sorted order which is different from the current implementation.
This PR explicitly constructs the columns in a sorted order in order to improve
user experience. Also, this implementation gives the same result with the
documentation.
**Before**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3),
(3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| 3| 2| 1|
+---------+---+---+---+
| 2| 1| 0| 2|
| 1| 0| 1| 1|
| 3| 1| 1| 0|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"),
(2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key",
"value").show()
+---------+---+---+---+
|key_value| c| a| b|
+---------+---+---+---+
| 2| 1| 2| 0|
| 1| 0| 1| 1|
| 3| 1| 0| 1|
+---------+---+---+---+
```
**After**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3),
(3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| 1| 2| 3|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"),
(2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key",
"value").show()
+---------+---+---+---+
|key_value| a| b| c|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with updated testcases.
Author: Dongjoon Hyun <[email protected]>
Closes #13436 from dongjoon-hyun/SPARK-15696.
commit 1cd4dde81fd95d4733e902908e7957a592f335a2
Author: Jeff Zhang <[email protected]>
Date: 2016-02-03T07:22:26Z
fix cache_dir issue
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]