GitHub user 272029252 opened a pull request:
https://github.com/apache/spark/pull/8738
[STREAMING]There is a dependent package conflict
When I use the Streaming, there is a dependent package
conflict.curator-client
[INFO] \- org.apache.spark:spark-core_2.10:jar:1.5.0:compile
[INFO] +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] | \- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO] | \- (org.apache.curator:curator-client:jar:2.4.0:compile -
omitted for conflict with 2.1.0-incubating)
[INFO] \- org.tachyonproject:tachyon-client:jar:0.7.1:compile
[INFO] \-
org.apache.curator:curator-client:jar:2.1.0-incubating:compile
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-1.5
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8738.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8738
----
commit 5bbb2d327d3fbcdb0d631ad3b8d960dfada64f9a
Author: Wenchen Fan <[email protected]>
Date: 2015-08-14T21:09:46Z
[SPARK-8670] [SQL] Nested columns can't be referenced in pyspark
This bug is caused by a wrong column-exist-check in `__getitem__` of
pyspark dataframe. `DataFrame.apply` accepts not only top level column names,
but also nested column name like `a.b`, so we should remove that check from
`__getitem__`.
Author: Wenchen Fan <[email protected]>
Closes #8202 from cloud-fan/nested.
commit 612b4609bdd38763725ae07d77c2176aa6756e64
Author: Tathagata Das <[email protected]>
Date: 2015-08-14T22:10:01Z
[SPARK-9966] [STREAMING] Handle couple of corner cases in PIDRateEstimator
1. The rate estimator should not estimate any rate when there are no
records in the batch, as there is no data to estimate the rate. In the current
state, it estimates and set the rate to zero. That is incorrect.
2. The rate estimator should not never set the rate to zero under any
circumstances. Otherwise the system will stop receiving data, and stop
generating useful estimates (see reason 1). So the fix is to define a
parameters that sets a lower bound on the estimated rate, so that the system
always receives some data.
Author: Tathagata Das <[email protected]>
Closes #8199 from tdas/SPARK-9966 and squashes the following commits:
829f793 [Tathagata Das] Fixed unit test and added comments
3a994db [Tathagata Das] Added min rate and updated tests in PIDRateEstimator
(cherry picked from commit f3bfb711c1742d0915e43bda8230b4d1d22b4190)
Signed-off-by: Tathagata Das <[email protected]>
commit 8d26247903a1b594df6e202f0834ed165f47bbdc
Author: Tathagata Das <[email protected]>
Date: 2015-08-14T22:54:14Z
[SPARK-9968] [STREAMING] Reduced time spent within synchronized block to
prevent lock starvation
When the rate limiter is actually limiting the rate at which data is
inserted into the buffer, the synchronized block of BlockGenerator.addData
stays blocked for long time. This causes the thread switching the buffer and
generating blocks (synchronized with addData) to starve and not generate blocks
for seconds. The correct solution is to not block on the rate limiter within
the synchronized block for adding data to the buffer.
Author: Tathagata Das <[email protected]>
Closes #8204 from tdas/SPARK-9968 and squashes the following commits:
8cbcc1b [Tathagata Das] Removed unused val
a73b645 [Tathagata Das] Reduced time spent within synchronized block
(cherry picked from commit 18a761ef7a01a4dfa1dd91abe78cd68f2f8fdb67)
Signed-off-by: Tathagata Das <[email protected]>
commit 6be945cef041c36aeda20c72b25b5659adea9c5c
Author: Yin Huai <[email protected]>
Date: 2015-08-15T00:35:17Z
[SPARK-9949] [SQL] Fix TakeOrderedAndProject's output.
https://issues.apache.org/jira/browse/SPARK-9949
Author: Yin Huai <[email protected]>
Closes #8179 from yhuai/SPARK-9949.
(cherry picked from commit 932b24fd144232fb08184f0bd0a46369ecba164e)
Signed-off-by: Reynold Xin <[email protected]>
commit d84291713c48a0129619e3642e87337a03f14e07
Author: Reynold Xin <[email protected]>
Date: 2015-08-15T03:55:32Z
[SPARK-9934] Deprecate NIO ConnectionManager.
Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark
1.6.0.
Author: Reynold Xin <[email protected]>
Closes #8162 from rxin/SPARK-9934.
(cherry picked from commit e5fd60415fbfea2c5c02602f7ddbc999dd058064)
Signed-off-by: Reynold Xin <[email protected]>
commit 3cdeeaf5eed4fb9d532d4ba41472590cac1cfd6d
Author: Davies Liu <[email protected]>
Date: 2015-08-15T03:56:55Z
[HOTFIX] fix duplicated braces
Author: Davies Liu <[email protected]>
Closes #8219 from davies/fix_typo.
(cherry picked from commit 37586e5449ff8f892d41f0b6b8fa1de83dd3849e)
Signed-off-by: Reynold Xin <[email protected]>
commit 83cbf60a2415178faae6ef7867bca0ceda347006
Author: Wenchen Fan <[email protected]>
Date: 2015-08-15T03:59:54Z
[SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan
at the end of analysis
Also alias the ExtractValue instead of wrapping it with UnresolvedAlias
when resolve attribute in LogicalPlan, as this alias will be trimmed if it's
unnecessary.
Based on #7957 without the changes to mllib, but instead maintaining
earlier behavior when using `withColumn` on expressions that already have
metadata.
Author: Wenchen Fan <[email protected]>
Author: Michael Armbrust <[email protected]>
Closes #8215 from marmbrus/pr/7957.
(cherry picked from commit ec29f2034a3306cc0afdc4c160b42c2eefa0897c)
Signed-off-by: Reynold Xin <[email protected]>
commit 33015009f5514fe510cdf5b486d2b84136a4522e
Author: zc he <[email protected]>
Date: 2015-08-15T04:28:50Z
[SPARK-9960] [GRAPHX] sendMessage type fix in LabelPropagation.scala
Author: zc he <[email protected]>
Closes #8188 from farseer90718/farseer-patch-1.
(cherry picked from commit 71a3af8a94f900a26ac7094f22ec1216cab62e15)
Signed-off-by: Reynold Xin <[email protected]>
commit d97af68af3e910eb7247e9832615758385d642b9
Author: Davies Liu <[email protected]>
Date: 2015-08-15T05:30:35Z
[SPARK-9725] [SQL] fix serialization of UTF8String across different JVM
The BYTE_ARRAY_OFFSET could be different in JVM with different
configurations (for example, different heap size, 24 if heap > 32G, otherwise
16), so offset of UTF8String is not portable, we should handler that during
serialization.
Author: Davies Liu <[email protected]>
Closes #8210 from davies/serialize_utf8string.
(cherry picked from commit 7c1e56825b716a7d703dff38254b4739755ac0c4)
Signed-off-by: Davies Liu <[email protected]>
commit 1a6f0af9f28519c4edf55225efcca772c0ae4803
Author: Herman van Hovell <[email protected]>
Date: 2015-08-15T09:46:04Z
[SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters
in doc
Tiny modification to a few comments ```sbt publishLocal``` work again.
Author: Herman van Hovell <[email protected]>
Closes #8209 from hvanhovell/SPARK-9980.
(cherry picked from commit a85fb6c07fdda5c74d53d6373910dcf5db3ff111)
Signed-off-by: Sean Owen <[email protected]>
commit 2fda1d8426b66e1d6f5f92317be453981e3770f9
Author: Wenchen Fan <[email protected]>
Date: 2015-08-15T21:13:12Z
[SPARK-9955] [SQL] correct error message for aggregate
We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as
calling `output` on unresolved `LogicalPlan` will produce confusing error
message.
Author: Wenchen Fan <[email protected]>
Closes #8203 from cloud-fan/error-msg and squashes the following commits:
1c67ca7 [Wenchen Fan] move test
7593080 [Wenchen Fan] correct error message for aggregate
(cherry picked from commit 570567258b5839c1e0e28b5182f4c29b119ed4c4)
Signed-off-by: Michael Armbrust <[email protected]>
commit 881baf100fa9d8135b16cd390c344e3a5995805e
Author: Joseph K. Bradley <[email protected]>
Date: 2015-08-16T01:48:20Z
[SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml
streaming pyspark tests
Recently, PySpark ML streaming tests have been flaky, most likely because
of the batches not being processed in time. Proposal: Replace the use of
_ssc_wait (which waits for a fixed amount of time) with a method which waits
for a fixed amount of time but can terminate early based on a termination
condition method. With this, we can extend the waiting period (to make tests
less flaky) but also stop early when possible (making tests faster on average,
which I verified locally).
CC: mengxr tdas freeman-lab
Author: Joseph K. Bradley <[email protected]>
Closes #8087 from jkbradley/streaming-ml-tests.
(cherry picked from commit 1db7179fae672fcec7b8de12c374dd384ce51c67)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 4f75ce2e193c813f4e3ad067749b6e7b4f0ee135
Author: Sun Rui <[email protected]>
Date: 2015-08-16T07:30:02Z
[SPARK-8844] [SPARKR] head/collect is broken in SparkR.
This is a WIP patch for SPARK-8844 for collecting reviews.
This bug is about reading an empty DataFrame. in readCol(),
lapply(1:numRows, function(x) {
does not take into consideration the case where numRows = 0.
Will add unit test case.
Author: Sun Rui <[email protected]>
Closes #7419 from sun-rui/SPARK-8844.
(cherry picked from commit 5f9ce738fe6bab3f0caffad0df1d3876178cf469)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit fa55c27427bec0291847d254f4424b754dd211c9
Author: Matei Zaharia <[email protected]>
Date: 2015-08-16T07:34:58Z
[SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow
deps
The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.
Author: Matei Zaharia <[email protected]>
Closes #8220 from mateiz/shuffle-loc-fix.
(cherry picked from commit cf016075a006034c24c5b758edb279f3e151d25d)
Signed-off-by: Matei Zaharia <[email protected]>
commit e2c6ef81030aaf472771d98ec86d1c17119f2c4e
Author: Kun Xu <[email protected]>
Date: 2015-08-16T06:44:23Z
[SPARK-9973] [SQL] Correct in-memory columnar buffer size
The `initialSize` argument of `ColumnBuilder.initialize()` should be the
number of rows rather than bytes. However `InMemoryColumnarTableScan`
passes in a byte size, which makes Spark SQL allocate more memory than
necessary when building in-memory columnar buffers.
Author: Kun Xu <[email protected]>
Closes #8189 from viper-kun/errorSize.
(cherry picked from commit 182f9b7a6d3a3ee7ec7de6abc24e296aa794e4e8)
Signed-off-by: Cheng Lian <[email protected]>
commit 90245f65c94a40d3210207abaf6f136f5ce2861f
Author: Cheng Lian <[email protected]>
Date: 2015-08-16T17:17:58Z
[SPARK-10005] [SQL] Fixes schema merging for nested structs
In case of schema merging, we only handled first level fields when
converting Parquet groups to `InternalRow`s. Nested struct fields are not
properly handled.
For example, the schema of a Parquet file to be read can be:
```
message individual {
required group f1 {
optional binary f11 (utf8);
}
}
```
while the global schema is:
```
message global {
required group f1 {
optional binary f11 (utf8);
optional int32 f12;
}
}
```
This PR fixes this issue by padding missing fields when creating actual
converters.
Author: Cheng Lian <[email protected]>
Closes #8228 from liancheng/spark-10005/nested-schema-merging.
(cherry picked from commit ae2370e72f93db8a28b262e8252c55fe1fc9873c)
Signed-off-by: Yin Huai <[email protected]>
commit 78275c48035d65359f4749b2da3faa3cc95bd607
Author: Yu ISHIKAWA <[email protected]>
Date: 2015-08-17T06:33:20Z
[SPARK-9871] [SPARKR] Add expression functions into SparkR which have a
variable parameter
### Summary
- Add `lit` function
- Add `concat`, `greatest`, `least` functions
I think we need to improve `collect` function in order to implement
`struct` function. Since `collect` doesn't work with arguments which includes a
nested `list` variable. It seems that a list against `struct` still has `jobj`
classes. So it would be better to solve this problem on another issue.
### JIRA
[[SPARK-9871] Add expression functions into SparkR which have a variable
parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871)
Author: Yu ISHIKAWA <[email protected]>
Closes #8194 from yu-iskw/SPARK-9856.
(cherry picked from commit 26e760581fdf7ca913da93fa80e73b7ddabcedf6)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit d554bf48cfd6b4e0ba33764cf14a27b1e6cc636d
Author: Feynman Liang <[email protected]>
Date: 2015-08-17T16:58:34Z
[SPARK-9959] [MLLIB] Association Rules Java Compatibility
mengxr
Author: Feynman Liang <[email protected]>
Closes #8206 from feynmanliang/SPARK-9959-arules-java.
(cherry picked from commit f7efda3975d46a8ce4fd720b3730127ea482560b)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 727944564d968dbab8352958f44e2209f9d172c3
Author: Cheng Lian <[email protected]>
Date: 2015-08-17T16:59:05Z
[SPARK-7837] [SQL] Avoids double closing output writers when commitTask()
fails
When inserting data into a `HadoopFsRelation`, if `commitTask()` of the
writer container fails, `abortTask()` will be invoked. However, both
`commitTask()` and `abortTask()` try to close the output writer(s). The problem
is that, closing underlying writers may not be an idempotent operation. E.g.,
`ParquetRecordWriter.close()` throws NPE when called twice.
Author: Cheng Lian <[email protected]>
Closes #8236 from liancheng/spark-7837/double-closing.
(cherry picked from commit 76c155dd4483d58499e5cb66e5e9373bb771dbeb)
Signed-off-by: Cheng Lian <[email protected]>
commit 76390ec00a659b5e3dead0792bfe51cbb59b883b
Author: Wenchen Fan <[email protected]>
Date: 2015-08-17T18:36:18Z
[SPARK-9950] [SQL] Wrong Analysis Error for grouping/aggregating on struct
fields
This issue has been fixed by https://github.com/apache/spark/pull/8215,
this PR added regression test for it.
Author: Wenchen Fan <[email protected]>
Closes #8222 from cloud-fan/minor and squashes the following commits:
0bbfb1c [Wenchen Fan] fix style...
7e2d8d9 [Wenchen Fan] add test
(cherry picked from commit a4acdabb103f6d04603163c9555c1ddc413c3b80)
Signed-off-by: Michael Armbrust <[email protected]>
commit 4daf79f154dfe24718cf39df100d9c7d3f4f4c98
Author: zsxwing <[email protected]>
Date: 2015-08-17T18:53:33Z
[SPARK-10036] [SQL] Load JDBC driver in DataFrameReader.jdbc and
DataFrameWriter.jdbc
This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating
connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`.
Author: zsxwing <[email protected]>
Closes #8232 from zsxwing/SPARK-10036 and squashes the following commits:
adf75de [zsxwing] Add extraOptions to the connection properties
57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and
DataFrameWriter.jdbc
(cherry picked from commit f10660fe7b809be2059da4f9781a5743f117a35a)
Signed-off-by: Michael Armbrust <[email protected]>
commit 24765cc9b4841dcdef62df63d8f5b55965947e15
Author: Yijie Shen <[email protected]>
Date: 2015-08-17T21:10:19Z
[SPARK-9526] [SQL] Utilize randomized tests to reveal potential bugs in sql
expressions
JIRA: https://issues.apache.org/jira/browse/SPARK-9526
This PR is a follow up of #7830, aiming at utilizing randomized tests to
reveal more potential bugs in sql expression.
Author: Yijie Shen <[email protected]>
Closes #7855 from yjshen/property_check.
(cherry picked from commit b265e282b62954548740a5767e97ab1678c65194)
Signed-off-by: Josh Rosen <[email protected]>
commit f77eaaf348fd95364e2ef7d50a80c50d17894431
Author: Yin Huai <[email protected]>
Date: 2015-08-17T22:30:50Z
[SPARK-9592] [SQL] Fix Last function implemented based on
AggregateExpression1.
https://issues.apache.org/jira/browse/SPARK-9592
#8113 has the fundamental fix. But, if we want to minimize the number of
changed lines, we can go with this one. Then, in 1.6, we merge #8113.
Author: Yin Huai <[email protected]>
Closes #8172 from yhuai/lastFix and squashes the following commits:
b28c42a [Yin Huai] Regression test.
af87086 [Yin Huai] Fix last.
(cherry picked from commit 772e7c18fb1a79c0f080408cb43307fe89a4fa04)
Signed-off-by: Michael Armbrust <[email protected]>
commit bb3bb2a48ee32a5de4637a73dd11930c72f9c77e
Author: Feynman Liang <[email protected]>
Date: 2015-08-17T22:42:14Z
[SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing
mengxr jkbradley
Author: Feynman Liang <[email protected]>
Closes #8255 from feynmanliang/SPARK-10068.
(cherry picked from commit fdaf17f63f751f02623414fbc7d0a2f545364050)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 0f1417b6f31e53dd78aae2a0a661d9ba32dce5b7
Author: Sameer Abhyankar <[email protected]>
Date: 2015-08-17T23:00:23Z
[SPARK-8920] [MLLIB] Add @since tags to mllib.linalg
Author: Sameer Abhyankar <[email protected]>
Author: Sameer Abhyankar <[email protected]>
Closes #7729 from sabhyankar/branch_8920.
(cherry picked from commit 088b11ec5949e135cb3db2a1ce136837e046c288)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 407175e824169a01762bdd27f704ac017d6d3e60
Author: Cheng Lian <[email protected]>
Date: 2015-08-18T00:25:14Z
[SPARK-9974] [BUILD] [SQL] Makes sure
com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar
PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible
format when possible. One of the consequence is that, we have to set
input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`,
which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1.
When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first
loads these input/output format classes, and thus classes in
com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is
defined as "runtime", and is not packaged into Spark assembly jar. This
results in a `ClassNotFoundException`.
This issue can be worked around by asking users to add parquet-hadoop 1.6.0
via the `--driver-class-path` option. However, considering Maven build is
immune to this problem, I feel it can be confusing and inconvenient for users.
So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to
"compile".
Author: Cheng Lian <[email protected]>
Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
(cherry picked from commit 52ae952574f5d641a398dd185e09e5a79318c8a9)
Signed-off-by: Reynold Xin <[email protected]>
commit eaeebb92f336d3862169c61e7dcc6afa2732084b
Author: Yanbo Liang <[email protected]>
Date: 2015-08-18T00:25:41Z
[SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for
ml.feature.ElementwiseProduct
Add Python API, user guide and example for ml.feature.ElementwiseProduct.
Author: Yanbo Liang <[email protected]>
Closes #8061 from yanboliang/SPARK-9768.
(cherry picked from commit 0076e8212334c613599dcbc2ac23f49e9e50cc44)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit f5ed9ede9f962b8d36c897ac9ca798947ae5b96f
Author: Prayag Chandran <[email protected]>
Date: 2015-08-18T00:26:08Z
SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
Added since tags to mllib.regression
Author: Prayag Chandran <[email protected]>
Closes #7518 from prayagchandran/sinceTags and squashes the following
commits:
fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags
to mllib.regression
(cherry picked from commit 18523c130548f0438dff8d1f25531fd2ed36e517)
Signed-off-by: DB Tsai <[email protected]>
commit 18b3d11f787c48b429ffdef0075d398d7a0ab1a1
Author: Feynman Liang <[email protected]>
Date: 2015-08-18T00:53:24Z
[SPARK-9898] [MLLIB] Prefix Span user guide
Adds user guide for `PrefixSpan`, including Scala and Java example code.
mengxr zhangjiajin
Author: Feynman Liang <[email protected]>
Closes #8253 from feynmanliang/SPARK-9898.
(cherry picked from commit 0b6b01761370629ce387c143a25d41f3a334ff28)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 5de0ffbd0e0aef170171cec8808eb4ec1ba79b0f
Author: Sandy Ryza <[email protected]>
Date: 2015-08-18T00:57:51Z
[SPARK-7707] User guide and example code for KernelDensity
Author: Sandy Ryza <[email protected]>
Closes #8230 from sryza/sandy-spark-7707.
(cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8)
Signed-off-by: Xiangrui Meng <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]