[GitHub] spark pull request #15908: [SPARK-18459][SPARK-18460][StructuredStreaming] R...

tdas Wed, 16 Nov 2016 11:11:25 -0800

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/15908


    [SPARK-18459][SPARK-18460][StructuredStreaming] Rename triggerId to batchId 
and add triggerDetails to json in StreamingQueryStatus (for branch-2.0)

    This is a fix for branch-2.0 for the earlier PR #15895 
    
    ## What changes were proposed in this pull request?
    
    SPARK-18459: triggerId seems like a number that should be increasing with 
each trigger, whether or not there is data in it. However, actually, triggerId 
increases only where there is a batch of data in a trigger. So its better to 
rename it to batchId.
    
    SPARK-18460: triggerDetails was missing from json representation. Fixed it.
    ## How was this patch tested?
    
    Updated tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-18459-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15908.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15908
    
----
commit f3cfce09274741cc04bf2f00e873b3b64976b6d5
Author: Shixiong Zhu <[email protected]>
Date:   2016-09-06T23:49:06Z

    [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'
    
    ## What changes were proposed in this pull request?
    
    Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of 
error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$`
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #14983 from zsxwing/SPARK-17316-3.
    
    (cherry picked from commit 175b4344112b376cbbbd05265125ed0e1b87d507)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit a23d4065c5705b805c69e569ea177167d44b5244
Author: Wenchen Fan <[email protected]>
Date:   2016-09-06T02:36:00Z

    [SPARK-17279][SQL] better error message for exceptions during ScalaUDF 
execution
    
    ## What changes were proposed in this pull request?
    
    If `ScalaUDF` throws exceptions during executing user code, sometimes it's 
hard for users to figure out what's wrong, especially when they use Spark 
shell. An example
    ```
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 
in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 
325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException
        at 
line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
        at 
line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    ...
    ```
    We should catch these exceptions and rethrow them with better error 
message, to say that the exception is happened in scala udf.
    
    This PR also does some clean up for `ScalaUDF` and add a unit test suite 
for it.
    
    ## How was this patch tested?
    
    the new test suite
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #14850 from cloud-fan/npe.
    
    (cherry picked from commit 8d08f43d09157b98e559c0be6ce6fd571a35e0d1)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 796577b43d3df94f5d3a8e4baeb0aa03fbbb3f21
Author: Tathagata Das <[email protected]>
Date:   2016-09-07T02:34:11Z

    [SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to 
save file names in FileStreamSource
    
    ## What changes were proposed in this pull request?
    
    When we create a filestream on a directory that has partitioned subdirs 
(i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir 
as Seq[String] which internally is a Stream[String]. This is because of this 
[line](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93),
 where a LinkedHashSet.values.toSeq returns Stream. Then when the 
[FileStreamSource](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79)
 filters this Stream[String] to remove the seen files, it creates a new 
Stream[String], which has a filter function that has a $outer reference to the 
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
causes NotSerializableException. This will happened even if there is just one 
file in the dir.
    
    Its important to note that this behavior is different in Scala 2.11. There 
is no $outer reference to FileStreamSource, so it does not throw 
NotSerializableException. However, with a large sequence of files (tested with 
10000 files), it throws StackOverflowError. This is because how Stream class is 
implemented. Its basically like a linked list, and attempting to serialize a 
long Stream requires *recursively* going through linked list, thus resulting in 
StackOverflowError.
    
    In short, across both Scala 2.10 and 2.11, serialization fails when both 
the following conditions are true.
    - file stream defined on a partitioned directory
    - directory has 10k+ files
    
    The right solution is to convert the seq to an array before writing to the 
log. This PR implements this fix in two ways.
    - Changing all uses for HDFSMetadataLog to ensure Array is used instead of 
Seq
    - Added a `require` in HDFSMetadataLog such that it is never used with type 
Seq
    
    ## How was this patch tested?
    
    Added unit test that test that ensures the file stream source can handle 
with 10000 files. This tests fails in both Scala 2.10 and 2.11 with different 
failures as indicated above.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #14987 from tdas/SPARK-17372.
    
    (cherry picked from commit eb1ab88a86ce35f3d6ba03b3a798099fbcf6b3fc)
    Signed-off-by: Tathagata Das <[email protected]>

commit ee6301a88e3b109398cec9bc470b5a88f72654dd
Author: Clark Fitzgerald <[email protected]>
Date:   2016-09-07T06:40:37Z

    [SPARK-16785] R dapply doesn't return array or raw columns
    
    Fixed bug in `dapplyCollect` by changing the `compute` function of 
`worker.R` to explicitly handle raw (binary) vectors.
    
    cc shivaram
    
    Unit tests
    
    Author: Clark Fitzgerald <[email protected]>
    
    Closes #14783 from clarkfitzg/SPARK-16785.
    
    (cherry picked from commit 9fccde4ff80fb0fd65a9e90eb3337965e4349de4)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit c8811adaa6b2fb6c5ca31520908d148326ebaf18
Author: Herman van Hovell <[email protected]>
Date:   2016-09-07T08:38:56Z

    [SPARK-17296][SQL] Simplify parser join processing [BACKPORT 2.0]
    
    ## What changes were proposed in this pull request?
    This PR backports https://github.com/apache/spark/pull/14867 to branch-2.0. 
It fixes a number of join ordering bugs.
    ## How was this patch tested?
    Added tests to `PlanParserSuite`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #14984 from hvanhovell/SPARK-17296-branch-2.0.

commit e6caceb5e141a1665b21d04079a86baca041e453
Author: Srinivasa Reddy Vundela <[email protected]>
Date:   2016-09-07T11:41:03Z

    [MINOR][SQL] Fixing the typo in unit test
    
    ## What changes were proposed in this pull request?
    
    Fixing the typo in the unit test of CodeGenerationSuite.scala
    
    ## How was this patch tested?
    Ran the unit test after fixing the typo and it passes
    
    Author: Srinivasa Reddy Vundela <[email protected]>
    
    Closes #14989 from vundela/typo_fix.
    
    (cherry picked from commit 76ad89e9241fb2dece95dd445661dd95ee4ef699)
    Signed-off-by: Sean Owen <[email protected]>

commit 078ac0e6321aeb72c670a65ec90b9c20ef0a7788
Author: Eric Liang <[email protected]>
Date:   2016-09-07T19:33:50Z

    [SPARK-17370] Shuffle service files not invalidated when a slave is lost
    
    ## What changes were proposed in this pull request?
    
    DAGScheduler invalidates shuffle files when an executor loss event occurs, 
but not when the external shuffle service is enabled. This is because when 
shuffle service is on, the shuffle file lifetime can exceed the executor 
lifetime.
    
    However, it also doesn't invalidate shuffle files when the shuffle service 
itself is lost (due to whole slave loss). This can cause long hangs when slaves 
are lost since the file loss is not detected until a subsequent stage attempts 
to read the shuffle files.
    
    The proposed fix is to also invalidate shuffle files when an executor is 
lost due to a `SlaveLost` event.
    
    ## How was this patch tested?
    
    Unit tests, also verified on an actual cluster that slave loss invalidates 
shuffle files immediately as expected.
    
    cc mateiz
    
    Author: Eric Liang <[email protected]>
    
    Closes #14931 from ericl/sc-4439.
    
    (cherry picked from commit 649fa4bf1d6fc9271ae56b6891bc93ebf57858d1)
    Signed-off-by: Josh Rosen <[email protected]>

commit 067752ce08dc035ee807d20be2202c385f88f01c
Author: Marcelo Vanzin <[email protected]>
Date:   2016-09-07T23:43:05Z

    [SPARK-16533][CORE] - backport driver deadlock fix to 2.0
    
    ## What changes were proposed in this pull request?
    Backport changes from #14710 and #14925 to 2.0
    
    Author: Marcelo Vanzin <[email protected]>
    Author: Angus Gerry <[email protected]>
    
    Closes #14933 from angolon/SPARK-16533-2.0.

commit 28377da380d3859e0a837aae1c39529228c515f5
Author: hyukjinkwon <[email protected]>
Date:   2016-09-08T04:22:32Z

    [SPARK-17339][CORE][BRANCH-2.0] Do not use path to get a filesystem in 
hadoopFile and newHadoopFile APIs
    
    ## What changes were proposed in this pull request?
    
    This PR backports https://github.com/apache/spark/pull/14960
    
    ## How was this patch tested?
    
    AppVeyor - 
https://ci.appveyor.com/project/HyukjinKwon/spark/build/86-backport-SPARK-17339-r
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #15008 from HyukjinKwon/backport-SPARK-17339.

commit e169085cd33ff498ecd5aab180af036ca644c9e0
Author: Thomas Graves <[email protected]>
Date:   2016-09-08T13:16:19Z

    [SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling 
upgrade
    
    branch-2.0 version of this patch.  The differences are in the 
YarnShuffleService for finding the location to put the DB. branch-2.0 does not 
use the yarn nm recovery path like master does.
    
    Tested in manually on 8 node yarn cluster and ran unit tests.  Manually 
tests verified DB created properly and it found them if already existed. 
Verified that during rolling upgrade credentials were reloaded and running 
application was not affected.
    
    Author: Thomas Graves <[email protected]>
    
    Closes #14997 from tgravescs/SPARK-16711-branch2.0.

commit c6e0dd1d46f40cd0451155ee9730f429fe212a27
Author: Felix Cheung <[email protected]>
Date:   2016-09-08T15:22:58Z

    [SPARK-17442][SPARKR] Additional arguments in write.df are not passed to 
data source
    
    ## What changes were proposed in this pull request?
    
    additional options were not passed down in write.df.
    
    ## How was this patch tested?
    
    unit tests
    falaki shivaram
    
    Author: Felix Cheung <[email protected]>
    
    Closes #15010 from felixcheung/testreadoptions.
    
    (cherry picked from commit f0d21b7f90cdcce353ab6fc279b9cc376e46e536)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit a7f1c18988a76b2cb970c5e0af64cc070c9ac67c
Author: Joseph K. Bradley <[email protected]>
Date:   2016-09-09T12:35:10Z

    [SPARK-17456][CORE] Utility for parsing Spark versions
    
    ## What changes were proposed in this pull request?
    
    This patch adds methods for extracting major and minor versions as Int 
types in Scala from a Spark version string.
    
    Motivation: There are many hacks within Spark's codebase to identify and 
compare Spark versions. We should add a simple utility to standardize these 
code paths, especially since there have been mistakes made in the past. This 
will let us add unit tests as well.  Currently, I want this functionality to 
check Spark versions to provide backwards compatibility for ML model 
persistence.
    
    ## How was this patch tested?
    
    Unit tests
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #15017 from jkbradley/version-parsing.
    
    (cherry picked from commit 65b814bf50e92e2e9b622d1602f18bacd217181c)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 6f02f40b4b4145d0cd7997b413a13ebcc12ec510
Author: hyukjinkwon <[email protected]>
Date:   2016-09-09T21:23:05Z

    [SPARK-17354] [SQL] Partitioning by dates/timestamps should work with 
Parquet vectorized reader
    
    This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized 
reader can read partitioned table with dates/timestamps. This works fine with 
Parquet normal reader.
    
    This is being only called within 
[VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).
    
    When partition column types are explicitly given to `DateType` or 
`TimestampType` (rather than inferring the type of partition column), this 
fails with the exception below:
    
    ```
    16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
    java.lang.ClassCastException: java.lang.Integer cannot be cast to 
java.sql.Date
        at 
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
    ...
    ```
    
    Unit tests in `SQLQuerySuite`.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #14919 from HyukjinKwon/SPARK-17354.
    
    (cherry picked from commit f7d2143705c8c1baeed0bc62940f9dba636e705b)
    Signed-off-by: Davies Liu <[email protected]>

commit c2378a6821e74cce52b853e7c556044b922625b1
Author: Ryan Blue <[email protected]>
Date:   2016-09-10T09:18:53Z

    [SPARK-17396][CORE] Share the task support between UnionRDD instances.
    
    ## What changes were proposed in this pull request?
    
    Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating 
a huge number of threads if lots of RDDs are created at the same time.
    
    ## How was this patch tested?
    
    This uses existing UnionRDD tests.
    
    Author: Ryan Blue <[email protected]>
    
    Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
    
    (cherry picked from commit 6ea5055fa734d435b5f148cf52d3385a57926b60)
    Signed-off-by: Sean Owen <[email protected]>

commit bde54526845a315de5b5c8ee3eae9f3d14debac5
Author: Timothy Hunter <[email protected]>
Date:   2016-09-11T07:03:45Z

    [SPARK-17439][SQL] Fixing compression issues with approximate quantiles and 
adding more tests
    
    This PR build on #14976 and fixes a correctness bug that would cause the 
wrong quantile to be returned for small target errors.
    
    This PR adds 8 unit tests that were failing without the fix.
    
    Author: Timothy Hunter <[email protected]>
    Author: Sean Owen <[email protected]>
    
    Closes #15002 from thunterdb/ml-1783.
    
    (cherry picked from commit 180796ecb3a00facde2d98affdb5aa38dd258875)
    Signed-off-by: Sean Owen <[email protected]>

commit d293062a4cd04acf48697b1fd6a70fef97b338da
Author: Bryan Cutler <[email protected]>
Date:   2016-09-11T09:19:39Z

    [SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from 
spark-config.sh
    
    ## What changes were proposed in this pull request?
    During startup of Spark standalone, the script file spark-config.sh appends 
to the PYTHONPATH and can be sourced many times, causing duplicates in the 
path.  This change adds a env flag that is set when the PYTHONPATH is appended 
so it will happen only one time.
    
    ## How was this patch tested?
    Manually started standalone master/worker and verified PYTHONPATH has no 
duplicate entries.
    
    Author: Bryan Cutler <[email protected]>
    
    Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.
    
    (cherry picked from commit c76baff0cc4775c2191d075cc9a8176e4915fec8)
    Signed-off-by: Sean Owen <[email protected]>

commit 30521522dbe65504876f0302030ef84945ad98b5
Author: Josh Rosen <[email protected]>
Date:   2016-09-12T04:51:22Z

    [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
    
    The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never 
read, increasing the memory consumption of the web UI. We should remove this 
field.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #15038 from 
JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
    
    (cherry picked from commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 0a36e360cd4bb2c66687caf017fbeeece41a7ccd
Author: Sean Zhong <[email protected]>
Date:   2016-09-12T18:30:06Z

    [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache 
the whole RDD in memory
    
    ## What changes were proposed in this pull request?
    
       MemoryStore may throw OutOfMemoryError when trying to cache a super big 
RDD that cannot fit in memory.
       ```
       scala> sc.parallelize(1 to 1000000000, 100).map(x => new 
Array[Long](1000)).cache().count()
    
       java.lang.OutOfMemoryError: Java heap space
        at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
        at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
        at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
        at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
       ```
    
    Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer 
to store all input values that it has read so far before transferring the 
values to storage memory cache. The problem is that when the input RDD is too 
big for caching in memory, the temporary unrolling memory SizeTrackingVector is 
not garbage collected in time. As SizeTrackingVector can occupy all available 
storage memory, it may cause the executor JVM to run out of memory quickly.
    
    More info can be found at https://issues.apache.org/jira/browse/SPARK-17503
    
    ## How was this patch tested?
    
    Unit test and manual test.
    
    ### Before change
    
    Heap memory consumption
    <img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" 
src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png";>
    
    Heap dump
    <img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" 
src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png";>
    
    ### After change
    
    Heap memory consumption
    <img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" 
src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png";>
    
    Author: Sean Zhong <[email protected]>
    
    Closes #15056 from clockfly/memory_store_leak.
    
    (cherry picked from commit 1742c3ab86d75ce3d352f7cddff65e62fb7c8dd4)
    Signed-off-by: Josh Rosen <[email protected]>

commit 37f45bf0d95f463c38e5636690545472c0399222
Author: Josh Rosen <[email protected]>
Date:   2016-09-12T22:24:33Z

    [SPARK-14818] Post-2.0 MiMa exclusion and build changes
    
    This patch makes a handful of post-Spark-2.0 MiMa exclusion and build 
updates. It should be merged to master and a subset of it should be picked into 
branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.
    
    - Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list 
of excluded subprojects so that MiMa checks them.
    - Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in 
`mimaSettings`.
    - Move the exclusion added in SPARK-14743 from `v20excludes` to 
`v21excludes`, since that patch was only merged into master and not branch-2.0.
    - Add exclusions for an API change introduced by SPARK-17096 / #14675.
    - Add missing exclusions for the `o.a.spark.internal` and 
`o.a.spark.sql.internal` packages.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #15061 from JoshRosen/post-2.0-mima-changes.
    
    (cherry picked from commit 7c51b99a428a965ff7d136e1cdda20305d260453)
    Signed-off-by: Josh Rosen <[email protected]>

commit a3fc5762b896e6531a66802dbfe583c98eccc42b
Author: Josh Rosen <[email protected]>
Date:   2016-09-12T22:43:57Z

    [SPARK-17485] Prevent failed remote reads of cached blocks from failing 
entire job
    
    ## What changes were proposed in this pull request?
    
    In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached 
RDD block, then a remote copy, and only fall back to recomputing the block if 
no cached copy (local or remote) can be read. This logic works correctly in the 
case where no remote copies of the block exist, but if there _are_ remote 
copies and reads of those copies fail (due to network issues or internal Spark 
bugs) then the BlockManager will throw a `BlockFetchException` that will fail 
the task (and which could possibly fail the whole job if the read failures keep 
occurring).
    
    In the cases of TorrentBroadcast and task result fetching we really do want 
to fail the entire job in case no remote blocks can be fetched, but this logic 
is inappropriate for reads of cached RDD blocks because those can/should be 
recomputed in case cached blocks are unavailable.
    
    Therefore, I think that the `BlockManager.getRemoteBytes()` method should 
never throw on remote fetch errors and, instead, should handle failures by 
returning `None`.
    
    ## How was this patch tested?
    
    Block manager changes should be covered by modified tests in 
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed 
remote reads, while the modified tests now expect `None` to be returned from 
the `getRemote*` method.
    
    I also manually inspected all usages of `BlockManager.getRemoteValues()`, 
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on 
the result and handle `None`. Note that these `None` branches are already 
exercised because the old `getRemoteBytes` returned `None` when no remote 
locations for the block could be found (which could occur if an executor died 
and its block manager de-registered with the master).
    
    Author: Josh Rosen <[email protected]>
    
    Closes #15037 from JoshRosen/SPARK-17485.
    
    (cherry picked from commit f9c580f11098d95f098936a0b90fa21d71021205)
    Signed-off-by: Josh Rosen <[email protected]>

commit 1f72e774bbfb1f57d79ef798c957cdcf1278409a
Author: Davies Liu <[email protected]>
Date:   2016-09-12T23:35:42Z

    [SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec
    
    ## What changes were proposed in this pull request?
    
    When there is any Python UDF in the Project between Sort and Limit, it will 
be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull 
the Python UDFs out because QueryPlan.expressions does not include the 
expression inside Option[Seq[Expression]].
    
    Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck 
(it always run into infinite loop). In PR, I changed the 
TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. 
cc JoshRosen
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <[email protected]>
    
    Closes #15030 from davies/all_expr.
    
    (cherry picked from commit a91ab705e8c124aa116c3e5b1f3ba88ce832dcde)
    Signed-off-by: Davies Liu <[email protected]>

commit b17f10ced34cbff8716610df370d19b130f93827
Author: Josh Rosen <[email protected]>
Date:   2016-09-13T10:54:03Z

    [SPARK-17515] CollectLimit.execute() should perform per-partition limits
    
    CollectLimit.execute() incorrectly omits per-partition limits, leading to 
performance regressions in case this case is hit (which should not happen in 
normal operation, but can occur in some cases (see #15068 for one example).
    
    Regression test in SQLQuerySuite that asserts the number of records scanned 
from the input RDD.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #15070 from JoshRosen/SPARK-17515.
    
    (cherry picked from commit 3f6a2bb3f7beac4ce928eb660ee36258b5b9e8c8)
    Signed-off-by: Herman van Hovell <[email protected]>

commit c1426452bb69f7eb2209d70449a5306fb29d875f
Author: Burak Yavuz <[email protected]>
Date:   2016-09-13T22:11:55Z

    [SPARK-17531] Don't initialize Hive Listeners for the Execution Client
    
    ## What changes were proposed in this pull request?
    
    If a user provides listeners inside the Hive Conf, the configuration for 
these listeners are passed to the Hive Execution Client as well. This may cause 
issues for two reasons:
    1. The Execution Client will actually generate garbage
    2. The listener class needs to be both in the Spark Classpath and Hive 
Classpath
    
    This PR empties the listener configurations in 
`HiveUtils.newTemporaryConfiguration` so that the execution client will not 
contain the listener confs, but the metadata client will.
    
    ## How was this patch tested?
    
    Unit tests
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #15086 from brkyvz/null-listeners.
    
    (cherry picked from commit 72edc7e958271cedb01932880550cfc2c0631204)
    Signed-off-by: Yin Huai <[email protected]>

commit 12ebfbeddf057efb666a7b6365c948c3fe479f2c
Author: Sami Jaktholm <[email protected]>
Date:   2016-09-14T08:38:30Z

    [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API 
as it was removed from the Scala API prior to Spark 2.0.0
    
    ## What changes were proposed in this pull request?
    
    This pull request removes the SparkContext.clearFiles() method from the 
PySpark API as the method was removed from the Scala API in 
8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to 
an exception as PySpark tries to call the non-existent method on the JVM side.
    
    ## How was this patch tested?
    
    Existing tests (though none of them tested this particular method).
    
    Author: Sami Jaktholm <[email protected]>
    
    Closes #15081 from sjakthol/pyspark-sc-clearfiles.
    
    (cherry picked from commit b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3)
    Signed-off-by: Sean Owen <[email protected]>

commit c6ea748a7e0baec222cbb4bd130673233adc5e0c
Author: Ergin Seyfe <[email protected]>
Date:   2016-09-14T08:51:14Z

    [SPARK-17480][SQL] Improve performance by removing or caching List.length 
which is O(n)
    
    ## What changes were proposed in this pull request?
    Scala's List.length method is O(N) and it makes the 
gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by 
writing it in Scala way.
    
    
https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36
    
    As suggested. Extended the fix to HiveInspectors and AggregationIterator 
classes as well.
    
    ## How was this patch tested?
    Profiled a Spark job and found that CompressibleColumnBuilder is using 39% 
of the CPU. Out of this 39% 
CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24% 
of the CPU is spend on List.length which is called inside 
gatherCompressibilityStats.
    
    After this change we started to save 6.24% of the CPU.
    
    Author: Ergin Seyfe <[email protected]>
    
    Closes #15032 from seyfe/gatherCompressibilityStats.
    
    (cherry picked from commit 4cea9da2ae88b40a5503111f8f37051e2372163e)
    Signed-off-by: Sean Owen <[email protected]>

commit 5493107d99977964cca1c15a2b0e084899e96dac
Author: Sean Owen <[email protected]>
Date:   2016-09-14T09:10:16Z

    [SPARK-17445][DOCS] Reference an ASF page as the main place to find 
third-party packages
    
    ## What changes were proposed in this pull request?
    
    Point references to spark-packages.org to 
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
    
    This will be accompanied by a parallel change to the spark-website repo, 
and additional changes to this wiki.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Sean Owen <[email protected]>
    
    Closes #15075 from srowen/SPARK-17445.
    
    (cherry picked from commit dc0a4c916151c795dc41b5714e9d23b4937f4636)
    Signed-off-by: Sean Owen <[email protected]>

commit 6fe5972e649e171e994c30bff3da0c408a3d7f3a
Author: Josh Rosen <[email protected]>
Date:   2016-09-14T17:10:01Z

    [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same 
in Python
    
    ## What changes were proposed in this pull request?
    
    In PySpark, `df.take(1)` runs a single-stage job which computes only one 
partition of the DataFrame, while `df.limit(1).collect()` computes all 
partitions and runs a two-stage job. This difference in performance is 
confusing.
    
    The reason why `limit(1).collect()` is so much slower is that `collect()` 
internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which 
causes Spark SQL to build a query where a global limit appears in the middle of 
the plan; this, in turn, ends up being executed inefficiently because limits in 
the middle of plans are now implemented by repartitioning to a single task 
rather than by running a `take()` job on the driver (this was done in #7334, a 
patch which was a prerequisite to allowing partition-local limits to be pushed 
beneath unions, etc.).
    
    In order to fix this performance problem I think that we should generalize 
the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates 
to the Scala implementation and shares the same performance properties. This 
patch modifies `DataFrame.collect()` to first collect all results to the driver 
and then pass them to Python, allowing this query to be planned using Spark's 
`CollectLimit` optimizations.
    
    ## How was this patch tested?
    
    Added a regression test in `sql/tests.py` which asserts that the expected 
number of jobs, stages, and tasks are run for both queries.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #15068 from JoshRosen/pyspark-collect-limit.
    
    (cherry picked from commit 6d06ff6f7e2dd72ba8fe96cd875e83eda6ebb2a9)
    Signed-off-by: Davies Liu <[email protected]>

commit fab77dadf70d011cec8976acfe8c851816f82426
Author: Kishor Patil <[email protected]>
Date:   2016-09-14T19:19:35Z

    [SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as 
Failed
    
    Due to race conditions, the ` assert(numExecutorsRunning <= 
targetNumExecutors)` can fail causing `AssertionError`. So removed the 
assertion, instead moved the conditional check before launching new container:
    ```
    java.lang.AssertionError: assertion failed
            at scala.Predef$.assert(Predef.scala:156)
            at 
org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
            at 
org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    ```
    This was manually tested using a large ForkAndJoin job with Dynamic 
Allocation enabled to validate the failing job succeeds, without any such 
exception.
    
    Author: Kishor Patil <[email protected]>
    
    Closes #15069 from kishorvpatil/SPARK-17511.
    
    (cherry picked from commit ff6e4cbdc80e2ad84c5d70ee07f323fad9374e3e)
    Signed-off-by: Tom Graves <[email protected]>

commit fffcec90b65047c3031c2b96679401f8fbef6337
Author: Shixiong Zhu <[email protected]>
Date:   2016-09-14T20:33:51Z

    [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value 
can be read thread-safely
    
    ## What changes were proposed in this pull request?
    
    Make CollectionAccumulator and SetAccumulator's value can be read 
thread-safely to fix the ConcurrentModificationException reported in 
[JIRA](https://issues.apache.org/jira/browse/SPARK-17463).
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #15063 from zsxwing/SPARK-17463.
    
    (cherry picked from commit e33bfaed3b160fbc617c878067af17477a0044f5)
    Signed-off-by: Josh Rosen <[email protected]>

commit bb2bdb44032d2e71832b3e0e771590fb2225e4f3
Author: Xing SHI <[email protected]>
Date:   2016-09-14T20:46:46Z

    [SPARK-17465][SPARK CORE] Inappropriate memory management in 
`org.apache.spark.storage.MemoryStore` may lead to memory leak
    
    The expression like `if (memoryMap(taskAttemptId) == 0) 
memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and 
`releasePendingUnrollMemoryForThisTask` should be called after release memory 
operation, whatever `memoryToRelease` is > 0 or not.
    
    If the memory of a task has been set to 0 when calling a 
`releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` 
method, the key in the memory map corresponding to that task will never be 
removed from the hash map.
    
    See the details in 
[SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465).
    
    Author: Xing SHI <[email protected]>
    
    Closes #15022 from saturday-shi/SPARK-17465.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15908: [SPARK-18459][SPARK-18460][StructuredStreaming] R...

Reply via email to