GitHub user curriedegg opened a pull request:
https://github.com/apache/spark/pull/10879
Branch 1.6
Will try to Improve parquet performance on s3 with DirectOutputCommiter and
Append Mode
Current State
Driver copies objects from staging then deletes
Desired
Distribute the object paths and issue copy from executors
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-1.6
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10879.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10879
----
commit 843a31afbdeea66449750f0ba8f676ef31d00726
Author: Cheng Lian <[email protected]>
Date: 2015-12-01T18:21:31Z
[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues
This PR backports PR #10039 to master
Author: Cheng Lian <[email protected]>
Closes #10063 from liancheng/spark-12046.doc-fix.master.
(cherry picked from commit 69dbe6b40df35d488d4ee343098ac70d00bbdafb)
Signed-off-by: Yin Huai <[email protected]>
commit 99dc1335e2f635a067f9fa1e83a35bf9593bfc24
Author: woj-i <[email protected]>
Date: 2015-12-01T19:05:45Z
[SPARK-11821] Propagate Kerberos keytab for all environments
andrewor14 the same PR as in branch 1.5
harishreedharan
Author: woj-i <[email protected]>
Closes #9859 from woj-i/master.
(cherry picked from commit 6a8cf80cc8ef435ec46138fa57325bda5d68f3ce)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit ab2a124c8eca6823ee016c9ecfbdbf4918fbcdd6
Author: Josh Rosen <[email protected]>
Date: 2015-12-01T19:49:20Z
[SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2
This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2.
Author: Josh Rosen <[email protected]>
Closes #10054 from JoshRosen/upgrade-to-tachyon-0.8.2.
(cherry picked from commit 34e7093c1131162b3aa05b65a19a633a0b5b633e)
Signed-off-by: Josh Rosen <[email protected]>
commit 1cf9d3858c8a3a5796b64a9fbea22509f02d778a
Author: Nong Li <[email protected]>
Date: 2015-12-01T20:59:53Z
[SPARK-12030] Fix Platform.copyMemory to handle overlapping regions.
This bug was exposed as memory corruption in Timsort which uses copyMemory
to copy
large regions that can overlap. The prior implementation did not handle
this case
half the time and always copied forward, resulting in the data being
corrupt.
Author: Nong Li <[email protected]>
Closes #10068 from nongli/spark-12030.
(cherry picked from commit 2cef1cdfbb5393270ae83179b6a4e50c3cbf9e93)
Signed-off-by: Yin Huai <[email protected]>
commit 81db8d086bbfe72caa0c45a395ebcaed80b5c237
Author: Tathagata Das <[email protected]>
Date: 2015-12-01T22:08:36Z
[SPARK-12004] Preserve the RDD partitioner through RDD checkpointing
The solution is the save the RDD partitioner in a separate file in the RDD
checkpoint directory. That is, `<checkpoint dir>/_partitioner`. In most cases,
whether the RDD partitioner was recovered or not, does not affect the
correctness, only reduces performance. So this solution makes a best-effort
attempt to save and recover the partitioner. If either fails, the checkpointing
is not affected. This makes this patch safe and backward compatible.
Author: Tathagata Das <[email protected]>
Closes #9983 from tdas/SPARK-12004.
(cherry picked from commit 60b541ee1b97c9e5e84aa2af2ce856f316ad22b3)
Signed-off-by: Andrew Or <[email protected]>
commit 21909b8ac0068658cc833f324c0f1f418c200d61
Author: Shixiong Zhu <[email protected]>
Date: 2015-12-01T23:16:07Z
Revert "[SPARK-12060][CORE] Avoid memory copy in
JavaSerializerInstance.serialize"
This reverts commit 9b99b2b46c452ba396e922db5fc7eec02c45b158.
commit 5647774b07593514f4ed4c29a038cfb1b69c9ba1
Author: Xusen Yin <[email protected]>
Date: 2015-12-01T23:21:53Z
[SPARK-11961][DOC] Add docs of ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11961
Author: Xusen Yin <[email protected]>
Closes #9965 from yinxusen/SPARK-11961.
(cherry picked from commit e76431f886ae8061545b3216e8e2fb38c4db1f43)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 012de2ce5de01bc57197fa26334fc175c8f20233
Author: jerryshao <[email protected]>
Date: 2015-12-01T23:26:10Z
[SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint
recovery issue
Fixed a minor race condition in #10017
Closes #10017
Author: jerryshao <[email protected]>
Author: Shixiong Zhu <[email protected]>
Closes #10074 from zsxwing/review-pr10017.
(cherry picked from commit f292018f8e57779debc04998456ec875f628133b)
Signed-off-by: Shixiong Zhu <[email protected]>
commit d77bf0bd922835b6a63bb1eeedf91e2a92d92ca9
Author: Josh Rosen <[email protected]>
Date: 2015-12-01T23:29:45Z
[SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up
TestHive.reset()
When profiling HiveCompatibilitySuite, I noticed that most of the time
seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up
suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the
following changes:
- Avoid `TestHive.reset()` whenever possible:
- Use a simple set of heuristics to guess whether we need to call
`reset()` in between tests.
- As a safety-net, automatically re-run failed tests by calling `reset()`
before the re-attempt.
- Speed up the expensive parts of `TestHive.reset()`: loading the `src` and
`srcpart` tables took roughly 600ms per test, so we now avoid this by using a
simple heuristic which only loads those tables by tests that reference them.
This is based on simple string matching over the test queries which errs on the
side of loading in more situations than might be strictly necessary.
After these changes, HiveCompatibilitySuite seems to run in about 10
minutes.
This PR is a revival of #6663, an earlier experimental PR from June, where
I played around with several possible speedups for this suite.
Author: Josh Rosen <[email protected]>
Closes #10055 from JoshRosen/speculative-testhive-reset.
(cherry picked from commit ef6790fdc3b70b9d6184ec2b3d926e4b0e4b15f6)
Signed-off-by: Reynold Xin <[email protected]>
commit f1122dd2bdc4c522a902b37bd34b46f785c21ecf
Author: Nong Li <[email protected]>
Date: 2015-12-01T23:30:21Z
[SPARK-11328][SQL] Improve error message when hitting this issue
The issue is that the output commiter is not idempotent and retry attempts
will
fail because the output file already exists. It is not safe to clean up the
file
as this output committer is by design not retryable. Currently, the job
fails
with a confusing file exists error. This patch is a stop gap to tell the
user
to look at the top of the error log for the proper message.
This is difficult to test locally as Spark is hardcoded not to retry.
Manually
verified by upping the retry attempts.
Author: Nong Li <[email protected]>
Author: Nong Li <[email protected]>
Closes #10080 from nongli/spark-11328.
(cherry picked from commit 47a0abc343550c855e679de12983f43e6fcc0171)
Signed-off-by: Yin Huai <[email protected]>
commit 1135430a00dbe6516097dd3bc868ae865e8e644d
Author: Huaxin Gao <[email protected]>
Date: 2015-12-01T23:32:57Z
[SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data
source
When query the Timestamp or Date column like the following
val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg &&
$"TIMESTAMP_COLUMN" < end)
The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
It should have quote around the Timestamp/Date value such as
"TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'"
Author: Huaxin Gao <[email protected]>
Closes #9872 from huaxingao/spark-11788.
(cherry picked from commit 5a8b5fdd6ffa58f015cdadf3f2c6df78e0a388ad)
Signed-off-by: Yin Huai <[email protected]>
commit 14eadf921132219f5597d689ac1ffd6e938a939a
Author: Yin Huai <[email protected]>
Date: 2015-12-02T00:24:04Z
[SPARK-11352][SQL] Escape */ in the generated comments.
https://issues.apache.org/jira/browse/SPARK-11352
Author: Yin Huai <[email protected]>
Closes #10072 from yhuai/SPARK-11352.
(cherry picked from commit 5872a9d89fe2720c2bcb1fc7494136947a72581c)
Signed-off-by: Yin Huai <[email protected]>
commit 1b3db967e05a628897b7162aa605b2e4650a0d58
Author: Yin Huai <[email protected]>
Date: 2015-12-02T01:18:45Z
[SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of
the current TreeNode, we should only return the simpleString.
In TreeNode's argString, if a TreeNode is not a child of the current
TreeNode, we will only return the simpleString.
I tested the [following case provided by
Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241).
```
val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>>>>>>>>>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
union.cache()
Some(union)
}
c.get.explain(true)
```
Without the change, `c.get.explain(true)` took 100s. With the change,
`c.get.explain(true)` took 26ms.
https://issues.apache.org/jira/browse/SPARK-11596
Author: Yin Huai <[email protected]>
Closes #10079 from yhuai/SPARK-11596.
(cherry picked from commit e96a70d5ab2e2b43a2df17a550fa9ed2ee0001c4)
Signed-off-by: Michael Armbrust <[email protected]>
commit 72da2a21f0940b97757ace5975535e559d627688
Author: Andrew Or <[email protected]>
Date: 2015-12-02T03:36:34Z
[SPARK-8414] Ensure context cleaner periodic cleanups
Garbage collection triggers cleanups. If the driver JVM is huge and there
is little memory pressure, we may never clean up shuffle files on executors.
This is a problem for long-running applications (e.g. streaming).
Author: Andrew Or <[email protected]>
Closes #10070 from andrewor14/periodic-gc.
(cherry picked from commit 1ce4adf55b535518c2e63917a827fac1f2df4e8e)
Signed-off-by: Josh Rosen <[email protected]>
commit 84c44b500b5c90dffbe1a6b0aa86f01699b09b96
Author: Andrew Or <[email protected]>
Date: 2015-12-02T03:51:12Z
[SPARK-12081] Make unified memory manager work with small heaps
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of
the space to work with. For small heaps, this is not enough: e.g. default 1GB
leaves only 250MB system memory. This is especially a problem in local mode,
where the driver and executor are crammed in the same JVM. Members of the
community have reported driver OOM's in such cases.
**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs,
this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is
proposal (1) listed in the
[JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
Author: Andrew Or <[email protected]>
Closes #10081 from andrewor14/unified-memory-small-heaps.
(cherry picked from commit d96f8c997b9bb5c3d61f513d2c71d67ccf8e85d6)
Signed-off-by: Andrew Or <[email protected]>
commit a5743affcf73f7bf71517171583cbddc44cc9368
Author: Davies Liu <[email protected]>
Date: 2015-12-02T04:17:12Z
[SPARK-12077][SQL] change the default plan for single distinct
Use try to match the behavior for single distinct aggregation with Spark
1.5, but that's not scalable, we should be robust by default, have a flag to
address performance regression for low cardinality aggregation.
cc yhuai nongli
Author: Davies Liu <[email protected]>
Closes #10075 from davies/agg_15.
(cherry picked from commit 96691feae0229fd693c29475620be2c4059dd080)
Signed-off-by: Yin Huai <[email protected]>
commit 1f42295b5df69a6039ed2ba8ea67a8e57d77644d
Author: Tathagata Das <[email protected]>
Date: 2015-12-02T05:04:52Z
[SPARK-12087][STREAMING] Create new JobConf for every batch in
saveAsHadoopFiles
The JobConf object created in `DStream.saveAsHadoopFiles` is used
concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is
launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread
is serializing it) can lead to concurrentModidicationException in the
underlying Java hashmap using in the internal Hadoop Configuration object.
The solution is to create a new JobConf in every batch, that is updated by
`RDD.saveAsHadoopFile()`, while the checkpointing serializes the original
JobConf.
Tests to be added in #9988 will fail reliably without this patch. Keeping
this patch really small to make sure that it can be added to previous branches.
Author: Tathagata Das <[email protected]>
Closes #10088 from tdas/SPARK-12087.
(cherry picked from commit 8a75a3049539eeef04c0db51736e97070c162b46)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 3c4938e26185dc0637f3af624830dbff11589997
Author: Liang-Chi Hsieh <[email protected]>
Date: 2015-12-02T05:51:33Z
[SPARK-11949][SQL] Check bitmasks to set nullable property
Following up #10038.
We can use bitmasks to determine which grouping expressions need to be set
as nullable.
cc yhuai
Author: Liang-Chi Hsieh <[email protected]>
Closes #10067 from viirya/fix-cube-following.
(cherry picked from commit 0f37d1d7ed7f6e34f98f2a3c274918de29e7a1d7)
Signed-off-by: Yin Huai <[email protected]>
commit c47a7373a88f49c77b8d65e887cac2ef1ae22eae
Author: Davies Liu <[email protected]>
Date: 2015-12-02T06:41:48Z
[SPARK-12090] [PYSPARK] consider shuffle in coalesce()
Author: Davies Liu <[email protected]>
Closes #10090 from davies/fix_coalesce.
(cherry picked from commit 4375eb3f48fc7ae90caf6c21a0d3ab0b66bf4efa)
Signed-off-by: Davies Liu <[email protected]>
commit d79dd971d01b69f8065b802fb5a78023ca905c7c
Author: Jeroen Schot <[email protected]>
Date: 2015-12-02T09:40:07Z
[SPARK-3580][CORE] Add Consistent Method To Get Number of RDD Partitions
Across Different Languages
I have tried to address all the comments in pull request
https://github.com/apache/spark/pull/2447.
Note that the second commit (using the new method in all internal code of
all components) is quite intrusive and could be omitted.
Author: Jeroen Schot <[email protected]>
Closes #9767 from schot/master.
(cherry picked from commit 128c29035b4e7383cc3a9a6c7a9ab6136205ac6c)
Signed-off-by: Sean Owen <[email protected]>
commit f449a407f6f152c676524d4348bbe34d4d3fbfca
Author: Cheng Lian <[email protected]>
Date: 2015-12-02T17:36:12Z
[SPARK-12094][SQL] Prettier tree string for TreeNode
When examining plans of complex queries with multiple joins, a pain point
of mine is that, it's hard to immediately see the sibling node of a specific
query plan node. This PR adds tree lines for the tree string of a `TreeNode`,
so that the result can be visually more intuitive.
Author: Cheng Lian <[email protected]>
Closes #10099 from liancheng/prettier-tree-string.
(cherry picked from commit a1542ce2f33ad365ff437d2d3014b9de2f6670e5)
Signed-off-by: Yin Huai <[email protected]>
commit bf525845cef159d2d4c9f4d64e158f037179b5c4
Author: Patrick Wendell <[email protected]>
Date: 2015-12-02T17:54:10Z
Preparing Spark release v1.6.0-rc1
commit 5d915fed300b47a51b7614d28bd8ea7795b4e841
Author: Patrick Wendell <[email protected]>
Date: 2015-12-02T17:54:15Z
Preparing development version 1.6.0-SNAPSHOT
commit 911259e9af6f9a81e775b1aa6d82fa44956bf993
Author: Yu ISHIKAWA <[email protected]>
Date: 2015-12-02T22:15:54Z
[SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning
cc mengxr noel-smith
I worked on this issues based on https://github.com/apache/spark/pull/8729.
ehsanmok thank you for your contricution!
Author: Yu ISHIKAWA <[email protected]>
Author: Ehsan M.Kermani <[email protected]>
Closes #9338 from yu-iskw/JIRA-10266.
(cherry picked from commit de07d06abecf3516c95d099b6c01a86e0c8cfd8c)
Signed-off-by: Xiangrui Meng <[email protected]>
commit cb142fd1e6d98b140de3813775c5a58ea624b1d4
Author: Yadong Qi <[email protected]>
Date: 2015-12-03T00:48:49Z
[SPARK-12093][SQL] Fix the error of comment in DDLParser
Author: Yadong Qi <[email protected]>
Closes #10096 from watermen/patch-1.
(cherry picked from commit d0d7ec533062151269b300ed455cf150a69098c0)
Signed-off-by: Reynold Xin <[email protected]>
commit 656d44e2021d2f637d724c1d71ecdca1f447a4be
Author: Xiangrui Meng <[email protected]>
Date: 2015-12-03T01:19:31Z
[SPARK-12000] do not specify arg types when reference a method in ScalaDoc
This fixes SPARK-12000, verified on my local with JDK 7. It seems that
`scaladoc` try to match method names and messed up with annotations.
cc: JoshRosen jkbradley
Author: Xiangrui Meng <[email protected]>
Closes #10114 from mengxr/SPARK-12000.2.
(cherry picked from commit 9bb695b7a82d837e2c7a724514ea6b203efb5364)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 6914ee9f0a063b880a0329365f465dcbe96e1adb
Author: Josh Rosen <[email protected]>
Date: 2015-12-03T03:12:02Z
[SPARK-12082][FLAKY-TEST] Increase timeouts in
NettyBlockTransferSecuritySuite
We should try increasing a timeout in NettyBlockTransferSecuritySuite in
order to reduce that suite's flakiness in Jenkins.
Author: Josh Rosen <[email protected]>
Closes #10113 from JoshRosen/SPARK-12082.
(cherry picked from commit ae402533738be06ac802914ed3e48f0d5fa54cbe)
Signed-off-by: Reynold Xin <[email protected]>
commit 6674fd8aa9b04966bd7d19650754805cd241e399
Author: Yin Huai <[email protected]>
Date: 2015-12-03T03:21:24Z
[SPARK-12109][SQL] Expressions's simpleString should delegate to its
toString.
https://issues.apache.org/jira/browse/SPARK-12109
The change of https://issues.apache.org/jira/browse/SPARK-11596 exposed the
problem.
In the sql plan viz, the filter shows

After changes in this PR, the viz is back to normal.

Author: Yin Huai <[email protected]>
Closes #10111 from yhuai/SPARK-12109.
(cherry picked from commit ec2b6c26c9b6bd59d29b5d7af2742aca7e6e0b07)
Signed-off-by: Reynold Xin <[email protected]>
commit 5826096ac1377c8fad4c2cabefee2f340008e828
Author: Huaxin Gao <[email protected]>
Date: 2015-12-03T08:42:21Z
[SPARK-12088][SQL] check connection.isClosed before calling connectionâ¦
In Java Spec java.sql.Connection, it has
boolean getAutoCommit() throws SQLException
Throws:
SQLException - if a database access error occurs or this method is called
on a closed connection
So if conn.getAutoCommit is called on a closed connection, a SQLException
will be thrown. Even though the code catch the SQLException and program can
continue, I think we should check conn.isClosed before calling
conn.getAutoCommit to avoid the unnecessary SQLException.
Author: Huaxin Gao <[email protected]>
Closes #10095 from huaxingao/spark-12088.
(cherry picked from commit 5349851f368a1b5dab8a99c0d51c9638ce7aec56)
Signed-off-by: Sean Owen <[email protected]>
commit 93b69ec45124611107a930af4c9b6413e7b3da62
Author: Jeff Zhang <[email protected]>
Date: 2015-12-03T15:36:28Z
[DOCUMENTATION][MLLIB] typo in mllib doc
\cc mengxr
Author: Jeff Zhang <[email protected]>
Closes #10093 from zjffdu/mllib_typo.
(cherry picked from commit 7470d9edbb0a45e714c96b5d55eff30724c0653a)
Signed-off-by: Sean Owen <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]