GitHub user yuany opened a pull request:
https://github.com/apache/spark/pull/8767
Master
Small typo in the example for `LabelledPoint` in the MLLib docs.
Author: Sean Paradiso <[email protected]>
Closes #8680 from sparadiso/docs_mllib_smalltypo.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8767.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8767
----
commit c6df5f66d9a8b9760f2cd46fcd930f977650c9c5
Author: zsxwing <[email protected]>
Date: 2015-08-24T00:41:49Z
[SPARK-10148] [STREAMING] Display active and inactive receiver numbers in
Streaming page
Added the active and inactive receiver numbers in the summary section of
Streaming page.
<img width="1074" alt="screen shot 2015-08-21 at 2 08 54 pm"
src="https://cloud.githubusercontent.com/assets/1000778/9402437/ff2806a2-480f-11e5-8f8e-efdf8e5d514d.png">
Author: zsxwing <[email protected]>
Closes #8351 from zsxwing/receiver-number.
commit b963c19a803c5a26c9b65655d40ca6621acf8bd4
Author: Joseph K. Bradley <[email protected]>
Date: 2015-08-24T01:34:07Z
[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug
GaussianMixture now distributes matrix decompositions for certain problem
sizes. Distributed computation actually fails, but this was not tested in unit
tests.
This PR adds a unit test which checks this. It failed previously but works
with this fix.
CC: mengxr
Author: Joseph K. Bradley <[email protected]>
Closes #8370 from jkbradley/gmm-fix.
commit 053d94fcf32268369b5a40837271f15d6af41aa4
Author: Tathagata Das <[email protected]>
Date: 2015-08-24T02:24:32Z
[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local
checkpoint paths and existing SparkContexts
The current code only checks checkpoint files in local filesystem, and
always tries to create a new Python SparkContext (even if one already exists).
The solution is to do the following:
1. Use the same code path as Java to check whether a valid checkpoint exists
2. Create a new Python SparkContext only if there no active one.
There is not test for the path as its hard to test with distributed
filesystem paths in a local unit test. I am going to test it with a distributed
file system manually to verify that this patch works.
Author: Tathagata Das <[email protected]>
Closes #8366 from tdas/SPARK-10142 and squashes the following commits:
3afa666 [Tathagata Das] Added tests
2dd4ae5 [Tathagata Das] Added the check to not create a context if one
already exists
9bf151b [Tathagata Das] Made python checkpoint recovery use java to find
the checkpoint files
commit 4e0395ddb764d092b5b38447af49e196e590e0f0
Author: zsxwing <[email protected]>
Date: 2015-08-24T19:38:01Z
[SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact
jars
This PR removed the `outputFile` configuration from pom.xml and updated
`tests.py` to search jars for both sbt build and maven build.
I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified
the jars in my local repository were correct. I also checked Python tests for
maven build, and it passed all tests.
Author: zsxwing <[email protected]>
Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits:
e0b5818 [zsxwing] Fix the sbt build
c697627 [zsxwing] Add the jar pathes to the exception message
be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
commit 7478c8b66d6a2b1179f20c38b49e27e37b0caec3
Author: Tathagata Das <[email protected]>
Date: 2015-08-24T19:40:09Z
[SPARK-9791] [PACKAGE] Change private class to private class to prevent
unnecessary classes from showing up in the docs
In addition, some random cleanup of import ordering
Author: Tathagata Das <[email protected]>
Closes #8387 from tdas/SPARK-9791 and squashes the following commits:
67f3ee9 [Tathagata Das] Change private class to private[package] class to
prevent them from showing up in the docs
commit 9ce0c7ad333f4a3c01207e5e9ed42bcafb99d894
Author: Burak Yavuz <[email protected]>
Date: 2015-08-24T20:48:01Z
[SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions
This PR contains examples on how to use some of the Stat Functions
available for DataFrames under `df.stat`.
rxin
Author: Burak Yavuz <[email protected]>
Closes #8378 from brkyvz/update-sql-docs.
commit 662bb9667669cb07cf6d2ccee0d8e76bb561cd89
Author: Andrew Or <[email protected]>
Date: 2015-08-24T21:10:50Z
[SPARK-10144] [UI] Actually show peak execution memory by default
The peak execution memory metric was introduced in SPARK-8735. That was
before Tungsten was enabled by default, so it assumed that
`spark.sql.unsafe.enabled` must be explicitly set to true. The result is that
the memory is not displayed by default.
Author: Andrew Or <[email protected]>
Closes #8345 from andrewor14/show-memory-default.
commit a2f4cdceba32aaa0df59df335ca0ce1ac73fc6c2
Author: Cheng Lian <[email protected]>
Date: 2015-08-24T21:11:19Z
[SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds more
test cases
This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to
add new test cases.
Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test
cases for them and marked as ignored for now. SPARK-10177 will be addressed in
a separate PR.
Author: Cheng Lian <[email protected]>
Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests.
commit cb2d2e15844d7ae34b5dd7028b55e11586ed93fa
Author: Sean Owen <[email protected]>
Date: 2015-08-24T21:35:21Z
[SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong package?
Move `test.org.apache.spark.sql.hive` package tests to apparent intended
`org.apache.spark.sql.hive` as they don't intend to test behavior from outside
org.apache.spark.*
Alternate take, per discussion at https://github.com/apache/spark/pull/8051
I think this is what vanzin and I had in mind but also CC rxin to
cross-check, as this does indeed depend on whether these tests were
accidentally in this package or not. Testing from a `test.org.apache.spark`
package is legitimate but didn't seem to be the intent here.
Author: Sean Owen <[email protected]>
Closes #8307 from srowen/SPARK-9758.
commit 13db11cb08eb90eb0ea3402c9fe0270aa282f247
Author: Joseph K. Bradley <[email protected]>
Date: 2015-08-24T22:38:54Z
[SPARK-10061] [DOC] ML ensemble docs
User guide for spark.ml GBTs and Random Forests.
The examples are copied from the decision tree guide and modified to run.
I caught some issues I had somehow missed in the tree guide as well.
I have run all examples, including Java ones. (Of course, I thought I had
previously as well...)
CC: mengxr manishamde yanboliang
Author: Joseph K. Bradley <[email protected]>
Closes #8369 from jkbradley/ml-ensemble-docs.
commit d7b4c095271c36fcc7f9ded267ecf5ec66fac803
Author: Josh Rosen <[email protected]>
Date: 2015-08-24T23:17:45Z
[SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter
This adds a missing null check to the Decimal `toScala` converter in
`CatalystTypeConverters`, fixing an NPE.
Author: Josh Rosen <[email protected]>
Closes #8401 from JoshRosen/SPARK-10190.
commit 2bf338c626e9d97ccc033cfadae8b36a82c66fd1
Author: Michael Armbrust <[email protected]>
Date: 2015-08-25T01:10:51Z
[SPARK-10165] [SQL] Await child resolution in ResolveFunctions
Currently, we eagerly attempt to resolve functions, even before their
children are resolved. However, this is not valid in cases where we need to
know the types of the input arguments (i.e. when resolving Hive UDFs).
As a fix, this PR delays function resolution until the functions children
are resolved. This change also necessitates a change to the way we resolve
aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or
`ORDER BY` clauses). Specifically, we can't assume that these misplaced
functions will be resolved, allowing us to differentiate aggregate functions
from normal functions. To compensate for this change we now attempt to resolve
these unresolved expressions in the context of the aggregate operator, before
checking to see if any aggregate expressions are present.
Author: Michael Armbrust <[email protected]>
Closes #8371 from marmbrus/hiveUDFResolution.
commit 6511bf559b736d8e23ae398901c8d78938e66869
Author: Yu ISHIKAWA <[email protected]>
Date: 2015-08-25T01:17:51Z
[SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release
cc: shivaram
## Summary
- Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions`
=> `rdname ascii`
- Replace the dynamical function definitions to the static ones because of
thir documentations.
## Generated PDF File
https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing
## JIRA
[[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF
JIRA](https://issues.apache.org/jira/browse/SPARK-10118)
Author: Yu ISHIKAWA <[email protected]>
Author: Yuu ISHIKAWA <[email protected]>
Closes #8386 from yu-iskw/SPARK-10118.
commit 642c43c81c835139e3f35dfd6a215d668a474203
Author: Feynman Liang <[email protected]>
Date: 2015-08-25T02:45:41Z
[SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of
Products
* Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with
`SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is
essentially a wrapper for the latter
* Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any
`RDD[Product]`, not just case classes
Author: Feynman Liang <[email protected]>
Closes #8406 from feynmanliang/sql-doc-fixes.
commit a0c0aae1defe5e1e57704065631d201f8e3f6bac
Author: Yin Huai <[email protected]>
Date: 2015-08-25T04:49:50Z
[SPARK-10121] [SQL] Thrift server always use the latest class loader
provided by the conf of executionHive's state
https://issues.apache.org/jira/browse/SPARK-10121
Looks like the problem is that if we add a jar through another thread, the
thread handling the JDBC session will not get the latest classloader.
Author: Yin Huai <[email protected]>
Closes #8368 from yhuai/SPARK-10121.
commit 5175ca0c85b10045d12c3fb57b1e52278a413ecf
Author: Michael Armbrust <[email protected]>
Date: 2015-08-25T06:15:27Z
[SPARK-10178] [SQL] HiveComparisionTest should print out dependent tables
In `HiveComparisionTest`s it is possible to fail a query of the form
`SELECT * FROM dest1`, where `dest1` is the query that is actually computing
the incorrect results. To aid debugging this patch improves the harness to
also print these query plans and their results.
Author: Michael Armbrust <[email protected]>
Closes #8388 from marmbrus/generatedTables.
commit d9c25dec87e6da7d66a47ff94e7eefa008081b9d
Author: cody koeninger <[email protected]>
Date: 2015-08-25T06:26:14Z
[SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defaâ¦
â¦ult maxRatePerPartition setting of 0
Author: cody koeninger <[email protected]>
Closes #8413 from koeninger/backpressure-testing-master.
commit f023aa2fcc1d1dbb82aee568be0a8f2457c309ae
Author: zsxwing <[email protected]>
Date: 2015-08-25T06:34:50Z
[SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleReceivers
returns balanced results
This PR fixes the following cases for `ReceiverSchedulingPolicy`.
1) Assume there are 4 executors: host1, host2, host3, host4, and 5
receivers: r1, r2, r3, r4, r5. Then
`ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 ->
host2, r3 -> host3, r4 -> host4, r5 -> host1).
Let's assume r1 starts at first on `host1` as `scheduleReceivers`
suggested, and try to register with ReceiverTracker. But the previous
`ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4)
according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 ->
0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected
since r1 is starting exactly where `scheduleReceivers` suggested.
This case can be fixed by ignoring the information of the receiver that is
rescheduling in `receiverTrackingInfoMap`.
2) Assume there are 3 executors (host1, host2, host3) and each executors
has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2
is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will
always return (host1, host2, host3). So it's possible that r2 will be scheduled
to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that
there are 3 receivers running on host1, while host2 and host3 are idle.
This issue can be fixed by returning only executors that have the minimum
wight rather than returning at least 3 executors.
Author: zsxwing <[email protected]>
Closes #8340 from zsxwing/fix-receiver-scheduling.
commit df7041d02d3fd44b08a859f5d77bf6fb726895f0
Author: Yin Huai <[email protected]>
Date: 2015-08-25T06:38:32Z
[SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON.
https://issues.apache.org/jira/browse/SPARK-10196
Author: Yin Huai <[email protected]>
Closes #8408 from yhuai/DecimalJsonSPARK-10196.
commit bf03fe68d62f33dda70dff45c3bda1f57b032dfc
Author: Cheng Lian <[email protected]>
Date: 2015-08-25T06:58:42Z
[SPARK-10136] [SQL] A more robust fix for SPARK-10136
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root
cause. The real problem can be rather tricky to explain, and requires
audiences to be pretty familiar with parquet-format spec, especially details of
`LIST` backwards-compatibility rules. Let me have a try to give an explanation
here.
The structure of the problematic Parquet schema generated by parquet-avro
is something like this:
```
message m {
<repetition> group f (LIST) { // Level 1
repeated group array (LIST) { // Level 2
repeated <primitive-type> array; // Level 3
}
}
}
```
(The schema generated by parquet-thrift is structurally similar, just
replace the `array` at level 2 with `f_tuple`, and the other one at level 3
with `f_tuple_tuple`.)
This structure consists of two nested legacy 2-level `LIST`-like structures:
1. The repeated group type at level 2 is the element type of the outer
array defined at level 1
This group should map to an `CatalystArrayConverter.ElementConverter`
when building converters.
2. The repeated primitive type at level 3 is the element type of the inner
array defined at level 2
This group should also map to an
`CatalystArrayConverter.ElementConverter`.
The root cause of SPARK-10136 is that, the group at level 2 isn't properly
recognized as the element type of level 1. Thus, according to parquet-format
spec, the repeated primitive at level 3 is left as a so called "unannotated
repeated primitive type", and is recognized as a required list of required
primitive type, thus a `RepeatedPrimitiveConverter` instead of a
`CatalystArrayConverter.ElementConverter` is created for it.
According to parquet-format spec, unannotated repeated type shouldn't
appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by
allowing such unannotated repeated type appear in `LIST`-annotated groups,
which is a non-standard, hacky, but valid fix. (I didn't realize this when
authoring #8341 though.)
As for the reason why level 2 isn't recognized as a list element type, it's
because of the following `LIST` backwards-compatibility rule defined in the
parquet-format spec:
> If the repeated field is a group with one field and is named either
`array` or uses the `LIST`-annotated group's name with `_tuple` appended then
the repeated type is the element type and elements are required.
(The `array` part is for parquet-avro compatibility, while the `_tuple`
part is for parquet-thrift.)
This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1],
but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers
a more robust fix by adding this rule in the latter method.
Note that parquet-avro 1.7.0 also suffers from this issue. Details can be
found at [PARQUET-364] [3].
[1]:
https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305
[2]:
https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463
[3]: https://issues.apache.org/jira/browse/PARQUET-364
Author: Cheng Lian <[email protected]>
Closes #8361 from liancheng/spark-10136/proper-version.
commit 82268f07abfa658869df2354ae72f8d6ddd119e8
Author: Josh Rosen <[email protected]>
Date: 2015-08-25T07:04:10Z
[SPARK-9293] [SPARK-9813] Analysis should check that set operations are
only performed on tables with equal numbers of columns
This patch adds an analyzer rule to ensure that set operations (union,
intersect, and except) are only applied to tables with the same number of
columns. Without this rule, there are scenarios where invalid queries can
return incorrect results instead of failing with error messages; SPARK-9813
provides one example of this problem. In other cases, the invalid query can
crash at runtime with extremely confusing exceptions.
I also performed a bit of cleanup to refactor some of those logical
operators' code into a common `SetOperation` base class.
Author: Josh Rosen <[email protected]>
Closes #7631 from JoshRosen/SPARK-9293.
commit d4549fe58fa0d781e0e891bceff893420cb1d598
Author: Yu ISHIKAWA <[email protected]>
Date: 2015-08-25T07:28:51Z
[SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs
cc: shivaram
## Summary
- Add name tags to each methods in DataFrame.R and column.R
- Replace `rdname column` with `rdname {each_func}`. i.e. alias method :
`rdname column` => `rdname alias`
## Generated PDF File
https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing
## JIRA
[[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF
JIRA](https://issues.apache.org/jira/browse/SPARK-10214)
Author: Yu ISHIKAWA <[email protected]>
Closes #8414 from yu-iskw/SPARK-10214.
commit 57b960bf3706728513f9e089455a533f0244312e
Author: Sean Owen <[email protected]>
Date: 2015-08-25T07:32:20Z
[SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided
Follow up to https://github.com/apache/spark/pull/7047
pwendell mentioned that MapR should use `hadoop-provided` now, and indeed
the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence
the action seems to be to remove the profiles, which are now not used.
CC trystanleftwich
Author: Sean Owen <[email protected]>
Closes #8338 from srowen/SPARK-6196.
commit 1fc37581a52530bac5d555dbf14927a5780c3b75
Author: Tathagata Das <[email protected]>
Date: 2015-08-25T07:35:51Z
[SPARK-10210] [STREAMING] Filter out non-existent blocks before creating
BlockRDD
When write ahead log is not enabled, a recovered streaming driver still
tries to run jobs using pre-failure block ids, and fails as the block do not
exists in-memory any more (and cannot be recovered as receiver WAL is not
enabled).
This occurs because the driver-side WAL of ReceivedBlockTracker is recovers
that past block information, and ReceiveInputDStream creates BlockRDDs even if
those blocks do not exist.
The solution in this PR is to filter out block ids that do not exist before
creating the BlockRDD. In addition, it adds unit tests to verify other logic in
ReceiverInputDStream.
Author: Tathagata Das <[email protected]>
Closes #8405 from tdas/SPARK-10210.
commit 2f493f7e3924b769160a16f73cccbebf21973b91
Author: Davies Liu <[email protected]>
Date: 2015-08-25T08:00:44Z
[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive
We misunderstood the Julian days and nanoseconds of the day in parquet (as
TimestampType) from Hive/Impala, they are overlapped, so can't be added
together directly.
In order to avoid the confusing rounding when do the converting, we use
`2440588` as the Julian Day of epoch of unix timestamp (which should be
2440587.5).
Author: Davies Liu <[email protected]>
Author: Cheng Lian <[email protected]>
Closes #8400 from davies/timestamp_parquet.
commit 7bc9a8c6249300ded31ea931c463d0a8f798e193
Author: Josh Rosen <[email protected]>
Date: 2015-08-25T08:06:36Z
[SPARK-10195] [SQL] Data sources Filter should not expose internal types
Spark SQL's data sources API exposes Catalyst's internal types through its
Filter interfaces. This is a problem because types like UTF8String are not
stable developer APIs and should not be exposed to third-parties.
This issue caused incompatibilities when upgrading our `spark-redshift`
library to work against Spark 1.5.0. To avoid these issues in the future we
should only expose public types through these Filter objects. This patch
accomplishes this by using CatalystTypeConverters to add the appropriate
conversions.
Author: Josh Rosen <[email protected]>
Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
commit 0e6368ffaec1965d0c7f89420e04a974675c7f6e
Author: Yin Huai <[email protected]>
Date: 2015-08-25T08:19:34Z
[SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors).
https://issues.apache.org/jira/browse/SPARK-10197
Author: Yin Huai <[email protected]>
Closes #8407 from yhuai/ORCSPARK-10197.
commit 5c14890159a5711072bf395f662b2433a389edf9
Author: Zhang, Liye <[email protected]>
Date: 2015-08-25T10:48:55Z
[DOC] add missing parameters in SparkContext.scala for scala doc
Author: Zhang, Liye <[email protected]>
Closes #8412 from liyezhang556520/minorDoc.
commit 7f1e507bf7e82bff323c5dec3c1ee044687c4173
Author: ehnalis <[email protected]>
Date: 2015-08-25T11:30:06Z
Fixed a typo in DAGScheduler.
Author: ehnalis <[email protected]>
Closes #8308 from ehnalis/master.
commit 69c9c177160e32a2fbc9b36ecc52156077fca6fc
Author: Sean Owen <[email protected]>
Date: 2015-08-25T11:33:13Z
[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing
uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have
been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <[email protected]>
Closes #8033 from srowen/SPARK-9613.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]