GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/15908
[SPARK-18459][SPARK-18460][StructuredStreaming] Rename triggerId to batchId
and add triggerDetails to json in StreamingQueryStatus (for branch-2.0)
This is a fix for branch-2.0 for the earlier PR #15895
## What changes were proposed in this pull request?
SPARK-18459: triggerId seems like a number that should be increasing with
each trigger, whether or not there is data in it. However, actually, triggerId
increases only where there is a batch of data in a trigger. So its better to
rename it to batchId.
SPARK-18460: triggerDetails was missing from json representation. Fixed it.
## How was this patch tested?
Updated tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark SPARK-18459-2.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15908.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15908
----
commit f3cfce09274741cc04bf2f00e873b3b64976b6d5
Author: Shixiong Zhu <[email protected]>
Date: 2016-09-06T23:49:06Z
[SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'
## What changes were proposed in this pull request?
Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of
error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$`
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #14983 from zsxwing/SPARK-17316-3.
(cherry picked from commit 175b4344112b376cbbbd05265125ed0e1b87d507)
Signed-off-by: Shixiong Zhu <[email protected]>
commit a23d4065c5705b805c69e569ea177167d44b5244
Author: Wenchen Fan <[email protected]>
Date: 2016-09-06T02:36:00Z
[SPARK-17279][SQL] better error message for exceptions during ScalaUDF
execution
## What changes were proposed in this pull request?
If `ScalaUDF` throws exceptions during executing user code, sometimes it's
hard for users to figure out what's wrong, especially when they use Spark
shell. An example
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12
in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage
325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException
at
line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
at
line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
...
```
We should catch these exceptions and rethrow them with better error
message, to say that the exception is happened in scala udf.
This PR also does some clean up for `ScalaUDF` and add a unit test suite
for it.
## How was this patch tested?
the new test suite
Author: Wenchen Fan <[email protected]>
Closes #14850 from cloud-fan/npe.
(cherry picked from commit 8d08f43d09157b98e559c0be6ce6fd571a35e0d1)
Signed-off-by: Wenchen Fan <[email protected]>
commit 796577b43d3df94f5d3a8e4baeb0aa03fbbb3f21
Author: Tathagata Das <[email protected]>
Date: 2016-09-07T02:34:11Z
[SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to
save file names in FileStreamSource
## What changes were proposed in this pull request?
When we create a filestream on a directory that has partitioned subdirs
(i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir
as Seq[String] which internally is a Stream[String]. This is because of this
[line](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93),
where a LinkedHashSet.values.toSeq returns Stream. Then when the
[FileStreamSource](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79)
filters this Stream[String] to remove the seen files, it creates a new
Stream[String], which has a filter function that has a $outer reference to the
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String]
causes NotSerializableException. This will happened even if there is just one
file in the dir.
Its important to note that this behavior is different in Scala 2.11. There
is no $outer reference to FileStreamSource, so it does not throw
NotSerializableException. However, with a large sequence of files (tested with
10000 files), it throws StackOverflowError. This is because how Stream class is
implemented. Its basically like a linked list, and attempting to serialize a
long Stream requires *recursively* going through linked list, thus resulting in
StackOverflowError.
In short, across both Scala 2.10 and 2.11, serialization fails when both
the following conditions are true.
- file stream defined on a partitioned directory
- directory has 10k+ files
The right solution is to convert the seq to an array before writing to the
log. This PR implements this fix in two ways.
- Changing all uses for HDFSMetadataLog to ensure Array is used instead of
Seq
- Added a `require` in HDFSMetadataLog such that it is never used with type
Seq
## How was this patch tested?
Added unit test that test that ensures the file stream source can handle
with 10000 files. This tests fails in both Scala 2.10 and 2.11 with different
failures as indicated above.
Author: Tathagata Das <[email protected]>
Closes #14987 from tdas/SPARK-17372.
(cherry picked from commit eb1ab88a86ce35f3d6ba03b3a798099fbcf6b3fc)
Signed-off-by: Tathagata Das <[email protected]>
commit ee6301a88e3b109398cec9bc470b5a88f72654dd
Author: Clark Fitzgerald <[email protected]>
Date: 2016-09-07T06:40:37Z
[SPARK-16785] R dapply doesn't return array or raw columns
Fixed bug in `dapplyCollect` by changing the `compute` function of
`worker.R` to explicitly handle raw (binary) vectors.
cc shivaram
Unit tests
Author: Clark Fitzgerald <[email protected]>
Closes #14783 from clarkfitzg/SPARK-16785.
(cherry picked from commit 9fccde4ff80fb0fd65a9e90eb3337965e4349de4)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit c8811adaa6b2fb6c5ca31520908d148326ebaf18
Author: Herman van Hovell <[email protected]>
Date: 2016-09-07T08:38:56Z
[SPARK-17296][SQL] Simplify parser join processing [BACKPORT 2.0]
## What changes were proposed in this pull request?
This PR backports https://github.com/apache/spark/pull/14867 to branch-2.0.
It fixes a number of join ordering bugs.
## How was this patch tested?
Added tests to `PlanParserSuite`.
Author: Herman van Hovell <[email protected]>
Closes #14984 from hvanhovell/SPARK-17296-branch-2.0.
commit e6caceb5e141a1665b21d04079a86baca041e453
Author: Srinivasa Reddy Vundela <[email protected]>
Date: 2016-09-07T11:41:03Z
[MINOR][SQL] Fixing the typo in unit test
## What changes were proposed in this pull request?
Fixing the typo in the unit test of CodeGenerationSuite.scala
## How was this patch tested?
Ran the unit test after fixing the typo and it passes
Author: Srinivasa Reddy Vundela <[email protected]>
Closes #14989 from vundela/typo_fix.
(cherry picked from commit 76ad89e9241fb2dece95dd445661dd95ee4ef699)
Signed-off-by: Sean Owen <[email protected]>
commit 078ac0e6321aeb72c670a65ec90b9c20ef0a7788
Author: Eric Liang <[email protected]>
Date: 2016-09-07T19:33:50Z
[SPARK-17370] Shuffle service files not invalidated when a slave is lost
## What changes were proposed in this pull request?
DAGScheduler invalidates shuffle files when an executor loss event occurs,
but not when the external shuffle service is enabled. This is because when
shuffle service is on, the shuffle file lifetime can exceed the executor
lifetime.
However, it also doesn't invalidate shuffle files when the shuffle service
itself is lost (due to whole slave loss). This can cause long hangs when slaves
are lost since the file loss is not detected until a subsequent stage attempts
to read the shuffle files.
The proposed fix is to also invalidate shuffle files when an executor is
lost due to a `SlaveLost` event.
## How was this patch tested?
Unit tests, also verified on an actual cluster that slave loss invalidates
shuffle files immediately as expected.
cc mateiz
Author: Eric Liang <[email protected]>
Closes #14931 from ericl/sc-4439.
(cherry picked from commit 649fa4bf1d6fc9271ae56b6891bc93ebf57858d1)
Signed-off-by: Josh Rosen <[email protected]>
commit 067752ce08dc035ee807d20be2202c385f88f01c
Author: Marcelo Vanzin <[email protected]>
Date: 2016-09-07T23:43:05Z
[SPARK-16533][CORE] - backport driver deadlock fix to 2.0
## What changes were proposed in this pull request?
Backport changes from #14710 and #14925 to 2.0
Author: Marcelo Vanzin <[email protected]>
Author: Angus Gerry <[email protected]>
Closes #14933 from angolon/SPARK-16533-2.0.
commit 28377da380d3859e0a837aae1c39529228c515f5
Author: hyukjinkwon <[email protected]>
Date: 2016-09-08T04:22:32Z
[SPARK-17339][CORE][BRANCH-2.0] Do not use path to get a filesystem in
hadoopFile and newHadoopFile APIs
## What changes were proposed in this pull request?
This PR backports https://github.com/apache/spark/pull/14960
## How was this patch tested?
AppVeyor -
https://ci.appveyor.com/project/HyukjinKwon/spark/build/86-backport-SPARK-17339-r
Author: hyukjinkwon <[email protected]>
Closes #15008 from HyukjinKwon/backport-SPARK-17339.
commit e169085cd33ff498ecd5aab180af036ca644c9e0
Author: Thomas Graves <[email protected]>
Date: 2016-09-08T13:16:19Z
[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling
upgrade
branch-2.0 version of this patch. The differences are in the
YarnShuffleService for finding the location to put the DB. branch-2.0 does not
use the yarn nm recovery path like master does.
Tested in manually on 8 node yarn cluster and ran unit tests. Manually
tests verified DB created properly and it found them if already existed.
Verified that during rolling upgrade credentials were reloaded and running
application was not affected.
Author: Thomas Graves <[email protected]>
Closes #14997 from tgravescs/SPARK-16711-branch2.0.
commit c6e0dd1d46f40cd0451155ee9730f429fe212a27
Author: Felix Cheung <[email protected]>
Date: 2016-09-08T15:22:58Z
[SPARK-17442][SPARKR] Additional arguments in write.df are not passed to
data source
## What changes were proposed in this pull request?
additional options were not passed down in write.df.
## How was this patch tested?
unit tests
falaki shivaram
Author: Felix Cheung <[email protected]>
Closes #15010 from felixcheung/testreadoptions.
(cherry picked from commit f0d21b7f90cdcce353ab6fc279b9cc376e46e536)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit a7f1c18988a76b2cb970c5e0af64cc070c9ac67c
Author: Joseph K. Bradley <[email protected]>
Date: 2016-09-09T12:35:10Z
[SPARK-17456][CORE] Utility for parsing Spark versions
## What changes were proposed in this pull request?
This patch adds methods for extracting major and minor versions as Int
types in Scala from a Spark version string.
Motivation: There are many hacks within Spark's codebase to identify and
compare Spark versions. We should add a simple utility to standardize these
code paths, especially since there have been mistakes made in the past. This
will let us add unit tests as well. Currently, I want this functionality to
check Spark versions to provide backwards compatibility for ML model
persistence.
## How was this patch tested?
Unit tests
Author: Joseph K. Bradley <[email protected]>
Closes #15017 from jkbradley/version-parsing.
(cherry picked from commit 65b814bf50e92e2e9b622d1602f18bacd217181c)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 6f02f40b4b4145d0cd7997b413a13ebcc12ec510
Author: hyukjinkwon <[email protected]>
Date: 2016-09-09T21:23:05Z
[SPARK-17354] [SQL] Partitioning by dates/timestamps should work with
Parquet vectorized reader
This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized
reader can read partitioned table with dates/timestamps. This works fine with
Parquet normal reader.
This is being only called within
[VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).
When partition column types are explicitly given to `DateType` or
`TimestampType` (rather than inferring the type of partition column), this
fails with the exception below:
```
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.sql.Date
at
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...
```
Unit tests in `SQLQuerySuite`.
Author: hyukjinkwon <[email protected]>
Closes #14919 from HyukjinKwon/SPARK-17354.
(cherry picked from commit f7d2143705c8c1baeed0bc62940f9dba636e705b)
Signed-off-by: Davies Liu <[email protected]>
commit c2378a6821e74cce52b853e7c556044b922625b1
Author: Ryan Blue <[email protected]>
Date: 2016-09-10T09:18:53Z
[SPARK-17396][CORE] Share the task support between UnionRDD instances.
## What changes were proposed in this pull request?
Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating
a huge number of threads if lots of RDDs are created at the same time.
## How was this patch tested?
This uses existing UnionRDD tests.
Author: Ryan Blue <[email protected]>
Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
(cherry picked from commit 6ea5055fa734d435b5f148cf52d3385a57926b60)
Signed-off-by: Sean Owen <[email protected]>
commit bde54526845a315de5b5c8ee3eae9f3d14debac5
Author: Timothy Hunter <[email protected]>
Date: 2016-09-11T07:03:45Z
[SPARK-17439][SQL] Fixing compression issues with approximate quantiles and
adding more tests
This PR build on #14976 and fixes a correctness bug that would cause the
wrong quantile to be returned for small target errors.
This PR adds 8 unit tests that were failing without the fix.
Author: Timothy Hunter <[email protected]>
Author: Sean Owen <[email protected]>
Closes #15002 from thunterdb/ml-1783.
(cherry picked from commit 180796ecb3a00facde2d98affdb5aa38dd258875)
Signed-off-by: Sean Owen <[email protected]>
commit d293062a4cd04acf48697b1fd6a70fef97b338da
Author: Bryan Cutler <[email protected]>
Date: 2016-09-11T09:19:39Z
[SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from
spark-config.sh
## What changes were proposed in this pull request?
During startup of Spark standalone, the script file spark-config.sh appends
to the PYTHONPATH and can be sourced many times, causing duplicates in the
path. This change adds a env flag that is set when the PYTHONPATH is appended
so it will happen only one time.
## How was this patch tested?
Manually started standalone master/worker and verified PYTHONPATH has no
duplicate entries.
Author: Bryan Cutler <[email protected]>
Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.
(cherry picked from commit c76baff0cc4775c2191d075cc9a8176e4915fec8)
Signed-off-by: Sean Owen <[email protected]>
commit 30521522dbe65504876f0302030ef84945ad98b5
Author: Josh Rosen <[email protected]>
Date: 2016-09-12T04:51:22Z
[SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never
read, increasing the memory consumption of the web UI. We should remove this
field.
Author: Josh Rosen <[email protected]>
Closes #15038 from
JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
(cherry picked from commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 0a36e360cd4bb2c66687caf017fbeeece41a7ccd
Author: Sean Zhong <[email protected]>
Date: 2016-09-12T18:30:06Z
[SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache
the whole RDD in memory
## What changes were proposed in this pull request?
MemoryStore may throw OutOfMemoryError when trying to cache a super big
RDD that cannot fit in memory.
```
scala> sc.parallelize(1 to 1000000000, 100).map(x => new
Array[Long](1000)).cache().count()
java.lang.OutOfMemoryError: Java heap space
at
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
at
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer
to store all input values that it has read so far before transferring the
values to storage memory cache. The problem is that when the input RDD is too
big for caching in memory, the temporary unrolling memory SizeTrackingVector is
not garbage collected in time. As SizeTrackingVector can occupy all available
storage memory, it may cause the executor JVM to run out of memory quickly.
More info can be found at https://issues.apache.org/jira/browse/SPARK-17503
## How was this patch tested?
Unit test and manual test.
### Before change
Heap memory consumption
<img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm"
src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png">
Heap dump
<img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm"
src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png">
### After change
Heap memory consumption
<img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm"
src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png">
Author: Sean Zhong <[email protected]>
Closes #15056 from clockfly/memory_store_leak.
(cherry picked from commit 1742c3ab86d75ce3d352f7cddff65e62fb7c8dd4)
Signed-off-by: Josh Rosen <[email protected]>
commit 37f45bf0d95f463c38e5636690545472c0399222
Author: Josh Rosen <[email protected]>
Date: 2016-09-12T22:24:33Z
[SPARK-14818] Post-2.0 MiMa exclusion and build changes
This patch makes a handful of post-Spark-2.0 MiMa exclusion and build
updates. It should be merged to master and a subset of it should be picked into
branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.
- Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list
of excluded subprojects so that MiMa checks them.
- Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in
`mimaSettings`.
- Move the exclusion added in SPARK-14743 from `v20excludes` to
`v21excludes`, since that patch was only merged into master and not branch-2.0.
- Add exclusions for an API change introduced by SPARK-17096 / #14675.
- Add missing exclusions for the `o.a.spark.internal` and
`o.a.spark.sql.internal` packages.
Author: Josh Rosen <[email protected]>
Closes #15061 from JoshRosen/post-2.0-mima-changes.
(cherry picked from commit 7c51b99a428a965ff7d136e1cdda20305d260453)
Signed-off-by: Josh Rosen <[email protected]>
commit a3fc5762b896e6531a66802dbfe583c98eccc42b
Author: Josh Rosen <[email protected]>
Date: 2016-09-12T22:43:57Z
[SPARK-17485] Prevent failed remote reads of cached blocks from failing
entire job
## What changes were proposed in this pull request?
In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached
RDD block, then a remote copy, and only fall back to recomputing the block if
no cached copy (local or remote) can be read. This logic works correctly in the
case where no remote copies of the block exist, but if there _are_ remote
copies and reads of those copies fail (due to network issues or internal Spark
bugs) then the BlockManager will throw a `BlockFetchException` that will fail
the task (and which could possibly fail the whole job if the read failures keep
occurring).
In the cases of TorrentBroadcast and task result fetching we really do want
to fail the entire job in case no remote blocks can be fetched, but this logic
is inappropriate for reads of cached RDD blocks because those can/should be
recomputed in case cached blocks are unavailable.
Therefore, I think that the `BlockManager.getRemoteBytes()` method should
never throw on remote fetch errors and, instead, should handle failures by
returning `None`.
## How was this patch tested?
Block manager changes should be covered by modified tests in
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed
remote reads, while the modified tests now expect `None` to be returned from
the `getRemote*` method.
I also manually inspected all usages of `BlockManager.getRemoteValues()`,
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on
the result and handle `None`. Note that these `None` branches are already
exercised because the old `getRemoteBytes` returned `None` when no remote
locations for the block could be found (which could occur if an executor died
and its block manager de-registered with the master).
Author: Josh Rosen <[email protected]>
Closes #15037 from JoshRosen/SPARK-17485.
(cherry picked from commit f9c580f11098d95f098936a0b90fa21d71021205)
Signed-off-by: Josh Rosen <[email protected]>
commit 1f72e774bbfb1f57d79ef798c957cdcf1278409a
Author: Davies Liu <[email protected]>
Date: 2016-09-12T23:35:42Z
[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec
## What changes were proposed in this pull request?
When there is any Python UDF in the Project between Sort and Limit, it will
be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull
the Python UDFs out because QueryPlan.expressions does not include the
expression inside Option[Seq[Expression]].
Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck
(it always run into infinite loop). In PR, I changed the
TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it.
cc JoshRosen
## How was this patch tested?
Added regression test.
Author: Davies Liu <[email protected]>
Closes #15030 from davies/all_expr.
(cherry picked from commit a91ab705e8c124aa116c3e5b1f3ba88ce832dcde)
Signed-off-by: Davies Liu <[email protected]>
commit b17f10ced34cbff8716610df370d19b130f93827
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T10:54:03Z
[SPARK-17515] CollectLimit.execute() should perform per-partition limits
CollectLimit.execute() incorrectly omits per-partition limits, leading to
performance regressions in case this case is hit (which should not happen in
normal operation, but can occur in some cases (see #15068 for one example).
Regression test in SQLQuerySuite that asserts the number of records scanned
from the input RDD.
Author: Josh Rosen <[email protected]>
Closes #15070 from JoshRosen/SPARK-17515.
(cherry picked from commit 3f6a2bb3f7beac4ce928eb660ee36258b5b9e8c8)
Signed-off-by: Herman van Hovell <[email protected]>
commit c1426452bb69f7eb2209d70449a5306fb29d875f
Author: Burak Yavuz <[email protected]>
Date: 2016-09-13T22:11:55Z
[SPARK-17531] Don't initialize Hive Listeners for the Execution Client
## What changes were proposed in this pull request?
If a user provides listeners inside the Hive Conf, the configuration for
these listeners are passed to the Hive Execution Client as well. This may cause
issues for two reasons:
1. The Execution Client will actually generate garbage
2. The listener class needs to be both in the Spark Classpath and Hive
Classpath
This PR empties the listener configurations in
`HiveUtils.newTemporaryConfiguration` so that the execution client will not
contain the listener confs, but the metadata client will.
## How was this patch tested?
Unit tests
Author: Burak Yavuz <[email protected]>
Closes #15086 from brkyvz/null-listeners.
(cherry picked from commit 72edc7e958271cedb01932880550cfc2c0631204)
Signed-off-by: Yin Huai <[email protected]>
commit 12ebfbeddf057efb666a7b6365c948c3fe479f2c
Author: Sami Jaktholm <[email protected]>
Date: 2016-09-14T08:38:30Z
[SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API
as it was removed from the Scala API prior to Spark 2.0.0
## What changes were proposed in this pull request?
This pull request removes the SparkContext.clearFiles() method from the
PySpark API as the method was removed from the Scala API in
8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to
an exception as PySpark tries to call the non-existent method on the JVM side.
## How was this patch tested?
Existing tests (though none of them tested this particular method).
Author: Sami Jaktholm <[email protected]>
Closes #15081 from sjakthol/pyspark-sc-clearfiles.
(cherry picked from commit b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3)
Signed-off-by: Sean Owen <[email protected]>
commit c6ea748a7e0baec222cbb4bd130673233adc5e0c
Author: Ergin Seyfe <[email protected]>
Date: 2016-09-14T08:51:14Z
[SPARK-17480][SQL] Improve performance by removing or caching List.length
which is O(n)
## What changes were proposed in this pull request?
Scala's List.length method is O(N) and it makes the
gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by
writing it in Scala way.
https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36
As suggested. Extended the fix to HiveInspectors and AggregationIterator
classes as well.
## How was this patch tested?
Profiled a Spark job and found that CompressibleColumnBuilder is using 39%
of the CPU. Out of this 39%
CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24%
of the CPU is spend on List.length which is called inside
gatherCompressibilityStats.
After this change we started to save 6.24% of the CPU.
Author: Ergin Seyfe <[email protected]>
Closes #15032 from seyfe/gatherCompressibilityStats.
(cherry picked from commit 4cea9da2ae88b40a5503111f8f37051e2372163e)
Signed-off-by: Sean Owen <[email protected]>
commit 5493107d99977964cca1c15a2b0e084899e96dac
Author: Sean Owen <[email protected]>
Date: 2016-09-14T09:10:16Z
[SPARK-17445][DOCS] Reference an ASF page as the main place to find
third-party packages
## What changes were proposed in this pull request?
Point references to spark-packages.org to
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
This will be accompanied by a parallel change to the spark-website repo,
and additional changes to this wiki.
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <[email protected]>
Closes #15075 from srowen/SPARK-17445.
(cherry picked from commit dc0a4c916151c795dc41b5714e9d23b4937f4636)
Signed-off-by: Sean Owen <[email protected]>
commit 6fe5972e649e171e994c30bff3da0c408a3d7f3a
Author: Josh Rosen <[email protected]>
Date: 2016-09-14T17:10:01Z
[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same
in Python
## What changes were proposed in this pull request?
In PySpark, `df.take(1)` runs a single-stage job which computes only one
partition of the DataFrame, while `df.limit(1).collect()` computes all
partitions and runs a two-stage job. This difference in performance is
confusing.
The reason why `limit(1).collect()` is so much slower is that `collect()`
internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which
causes Spark SQL to build a query where a global limit appears in the middle of
the plan; this, in turn, ends up being executed inefficiently because limits in
the middle of plans are now implemented by repartitioning to a single task
rather than by running a `take()` job on the driver (this was done in #7334, a
patch which was a prerequisite to allowing partition-local limits to be pushed
beneath unions, etc.).
In order to fix this performance problem I think that we should generalize
the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates
to the Scala implementation and shares the same performance properties. This
patch modifies `DataFrame.collect()` to first collect all results to the driver
and then pass them to Python, allowing this query to be planned using Spark's
`CollectLimit` optimizations.
## How was this patch tested?
Added a regression test in `sql/tests.py` which asserts that the expected
number of jobs, stages, and tasks are run for both queries.
Author: Josh Rosen <[email protected]>
Closes #15068 from JoshRosen/pyspark-collect-limit.
(cherry picked from commit 6d06ff6f7e2dd72ba8fe96cd875e83eda6ebb2a9)
Signed-off-by: Davies Liu <[email protected]>
commit fab77dadf70d011cec8976acfe8c851816f82426
Author: Kishor Patil <[email protected]>
Date: 2016-09-14T19:19:35Z
[SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as
Failed
Due to race conditions, the ` assert(numExecutorsRunning <=
targetNumExecutors)` can fail causing `AssertionError`. So removed the
assertion, instead moved the conditional check before launching new container:
```
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at
org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
at
org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
This was manually tested using a large ForkAndJoin job with Dynamic
Allocation enabled to validate the failing job succeeds, without any such
exception.
Author: Kishor Patil <[email protected]>
Closes #15069 from kishorvpatil/SPARK-17511.
(cherry picked from commit ff6e4cbdc80e2ad84c5d70ee07f323fad9374e3e)
Signed-off-by: Tom Graves <[email protected]>
commit fffcec90b65047c3031c2b96679401f8fbef6337
Author: Shixiong Zhu <[email protected]>
Date: 2016-09-14T20:33:51Z
[SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value
can be read thread-safely
## What changes were proposed in this pull request?
Make CollectionAccumulator and SetAccumulator's value can be read
thread-safely to fix the ConcurrentModificationException reported in
[JIRA](https://issues.apache.org/jira/browse/SPARK-17463).
## How was this patch tested?
Existing tests.
Author: Shixiong Zhu <[email protected]>
Closes #15063 from zsxwing/SPARK-17463.
(cherry picked from commit e33bfaed3b160fbc617c878067af17477a0044f5)
Signed-off-by: Josh Rosen <[email protected]>
commit bb2bdb44032d2e71832b3e0e771590fb2225e4f3
Author: Xing SHI <[email protected]>
Date: 2016-09-14T20:46:46Z
[SPARK-17465][SPARK CORE] Inappropriate memory management in
`org.apache.spark.storage.MemoryStore` may lead to memory leak
The expression like `if (memoryMap(taskAttemptId) == 0)
memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and
`releasePendingUnrollMemoryForThisTask` should be called after release memory
operation, whatever `memoryToRelease` is > 0 or not.
If the memory of a task has been set to 0 when calling a
`releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask`
method, the key in the memory map corresponding to that task will never be
removed from the hash map.
See the details in
[SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465).
Author: Xing SHI <[email protected]>
Closes #15022 from saturday-shi/SPARK-17465.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]