GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/16849
[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for
branch-2.1
## What changes were proposed in this pull request?
`mapGroupsWithState` is a new API for arbitrary stateful operations in
Structured Streaming, similar to `DStream.mapWithState`
*Requirements*
- Users should be able to specify a function that can do the following
- Access the input row corresponding to a key
- Access the previous state corresponding to a key
- Optionally, update or remove the state
- Output any number of new rows (or none at all)
*Proposed API*
```
// ------------ New methods on KeyValueGroupedDataset ------------
class KeyValueGroupedDataset[K, V] {
// Scala friendly
def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V],
KeyedState[S]) => U)
def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K,
Iterator[V], KeyedState[S]) => Iterator[U])
// Java friendly
def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V,
S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
def flatMapGroupsWithState[S, U](func:
FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S],
resultEncoder: Encoder[U])
}
// ------------------- New Java-friendly function classes
-------------------
public interface MapGroupsWithStateFunction<K, V, S, R> extends
Serializable {
R call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
}
public interface FlatMapGroupsWithStateFunction<K, V, S, R> extends
Serializable {
Iterator<R> call(K key, Iterator<V> values, state: KeyedState<S>) throws
Exception;
}
// ---------------------- Wrapper class for state data
----------------------
trait KeyedState[S] {
def exists(): Boolean
def get(): S // throws Exception is state does not
exist
def getOption(): Option[S]
def update(newState: S): Unit
def remove(): Unit // exists() will be false after this
}
```
Key Semantics of the State class
- The state can be null.
- If the state.remove() is called, then state.exists() will return false,
and getOption will returm None.
- After that state.update(newState) is called, then state.exists() will
return true, and getOption will return Some(...).
- None of the operations are thread-safe. This is to avoid memory barriers.
*Usage*
```
val stateFunc = (word: String, words: Iterator[String, runningCount:
KeyedState[Long]) => {
val newCount = words.size + runningCount.getOption.getOrElse(0L)
runningCount.update(newCount)
(word, newCount)
}
dataset // type
is Dataset[String]
.groupByKey[String](w => w) // generates
KeyValueGroupedDataset[String, String]
.mapGroupsWithState[Long, (String, Long)](stateFunc) // returns
Dataset[(String, Long)]
```
## How was this patch tested?
New unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark mapWithState-branch-2.1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16849.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16849
----
commit e8d8e350998e6e44a6dee7f78dbe2d1aa997c1d6
Author: [email protected] <[email protected]>
Date: 2016-12-01T04:32:17Z
[SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support
output original label.
## What changes were proposed in this pull request?
Similar to SPARK-18401, as a classification algorithm, logistic regression
should support output original label instead of supporting index label.
In this PR, original label output is supported and test cases are modified
and added. Document is also modified.
## How was this patch tested?
Unit tests.
Author: [email protected] <[email protected]>
Closes #15910 from wangmiao1981/audit.
(cherry picked from commit 2eb6764fbb23553fc17772d8a4a1cad55ff7ba6e)
Signed-off-by: Yanbo Liang <[email protected]>
commit 9dc3ef6e11b7dd3fd916d1442733938dcb5750e3
Author: Eric Liang <[email protected]>
Date: 2016-12-01T08:48:10Z
[SPARK-18635][SQL] Partition name/values not escaped correctly in some cases
## What changes were proposed in this pull request?
Due to confusion between URI vs paths, in certain cases we escape partition
values too many times, which causes some Hive client operations to fail or
write data to the wrong location. This PR fixes at least some of these cases.
To my understanding this is how values, filesystem paths, and URIs interact.
- Hive stores raw (unescaped) partition values that are returned to you
directly when you call listPartitions.
- Internally, we convert these raw values to filesystem paths via
`ExternalCatalogUtils.[un]escapePathName`.
- In some circumstances we store URIs instead of filesystem paths. When a
path is converted to a URI via `path.toURI`, the escaped partition values are
further URI-encoded. This means that to get a path back from a URI, you must
call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
- In `CatalogStorageFormat` we store URIs as strings. This makes it easy to
forget to URI-decode the value before converting it into a path.
- Finally, the Hive client itself uses mostly Paths for representing
locations, and only URIs occasionally.
In the future we should probably clean this up, perhaps by dropping use of
URIs when unnecessary. We should also try fixing escaping for partition names
as well as values, though names are unlikely to contain special characters.
cc mallman cloud-fan yhuai
## How was this patch tested?
Unit tests.
Author: Eric Liang <[email protected]>
Closes #16071 from ericl/spark-18635.
(cherry picked from commit 88f559f20a5208f2386b874eb119f1cba2c748c7)
Signed-off-by: Wenchen Fan <[email protected]>
commit 8579ab5d7092a65f044fd925ecd5b790305f0aef
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-12-01T09:57:58Z
[SPARK-18666][WEB UI] Remove the codes checking deprecated config
spark.sql.unsafe.enabled
## What changes were proposed in this pull request?
`spark.sql.unsafe.enabled` is deprecated since 1.6. There still are codes
in UI to check it. We should remove it and clean the codes.
## How was this patch tested?
Changes to related existing unit test.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Liang-Chi Hsieh <[email protected]>
Closes #16095 from viirya/remove-deprecated-config-code.
(cherry picked from commit dbf842b7a8479f9566146192ffc04421591742d5)
Signed-off-by: Reynold Xin <[email protected]>
commit cbbe217777173b100de2f5a613c46428974826f6
Author: Yuming Wang <[email protected]>
Date: 2016-12-01T13:14:09Z
[SPARK-18645][DEPLOY] Fix spark-daemon.sh arguments error lead to throws
Unrecognized option
## What changes were proposed in this pull request?
spark-daemon.sh will lost single quotes around after #15338. as follows:
```
execute_command nice -n 0 bash
/opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift
JDBC/ODBC Server --conf spark.driver.extraJavaOptions=-XX:+UseG1GC
-XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
```
With this fix, as follows:
```
execute_command nice -n 0 bash
/opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name
'Thrift JDBC/ODBC Server' --conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC
-XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'
```
## How was this patch tested?
- Manual tests
- Build the package and start-thriftserver.sh with `--conf
'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp'`
Author: Yuming Wang <[email protected]>
Closes #16079 from wangyum/SPARK-18645.
(cherry picked from commit 2ab8551e79e1655c406c358b21c0a1e719f498be)
Signed-off-by: Sean Owen <[email protected]>
commit 6916ddc385fc33fa390e541300ca2bb1dbd0599c
Author: Wenchen Fan <[email protected]>
Date: 2016-12-01T19:53:12Z
[SPARK-18674][SQL] improve the error message of using join
## What changes were proposed in this pull request?
The current error message of USING join is quite confusing, for example:
```
scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
df1: org.apache.spark.sql.DataFrame = [c1: int]
scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
df2: org.apache.spark.sql.DataFrame = [c2: int]
scala> df1.join(df2, usingColumn = "c1")
org.apache.spark.sql.AnalysisException: using columns ['c1] can not be
resolved given input columns: [c1, c2] ;;
'Join UsingJoin(Inner,List('c1))
:- Project [value#1 AS c1#3]
: +- LocalRelation [value#1]
+- Project [value#7 AS c2#9]
+- LocalRelation [value#7]
```
after this PR, it becomes:
```
scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
df1: org.apache.spark.sql.DataFrame = [c1: int]
scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
df2: org.apache.spark.sql.DataFrame = [c2: int]
scala> df1.join(df2, usingColumn = "c1")
org.apache.spark.sql.AnalysisException: USING column `c1` can not be
resolved with the right join side, the right output is: [c2];
```
## How was this patch tested?
updated tests
Author: Wenchen Fan <[email protected]>
Closes #16100 from cloud-fan/natural.
(cherry picked from commit e6534847100670a22b3b191a0f9d924fab7f3c02)
Signed-off-by: Herman van Hovell <[email protected]>
commit 4c673c656d52d29813979e942851b9205e4ace06
Author: Sandeep Singh <[email protected]>
Date: 2016-12-01T21:22:40Z
[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in
destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from
`JavaModel` to `JavaParams`
## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer
l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), )
for _ in range(int(7e5))] # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])
for i in range(50):
indexer = StringIndexer(inputCol='string', outputCol='index')
indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and
is halted midway
After: garbage collection works as the object is dereferenced, and
computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.
Author: Sandeep Singh <[email protected]>
Author: jkbradley <[email protected]>
Closes #15843 from techaddict/SPARK-18274.
(cherry picked from commit 78bb7f8071379114314c394e0167c4c5fd8545c5)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 4746674ad3acfc38bbd3e2708d75280c19ef0202
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-01T22:22:49Z
[SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite.
Receiver data should be deserialized properly
## What changes were proposed in this pull request?
Avoid to create multiple threads to stop StreamingContext. Otherwise, the
latch added in #16091 can be passed too early.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #16105 from zsxwing/SPARK-18617-2.
(cherry picked from commit 086b0c8f6788b205bc630d5ccf078f77b9751af3)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 2d2e80180f3b746df9e45a49bc62da31a37dadb8
Author: Reynold Xin <[email protected]>
Date: 2016-12-02T01:58:28Z
[SPARK-18639] Build only a single pip package
## What changes were proposed in this pull request?
We current build 5 separate pip binary tar balls, doubling the release
script runtime. It'd be better to build one, especially for use cases that are
just using Spark locally. In the long run, it would make more sense to have
Hadoop support be pluggable.
## How was this patch tested?
N/A - this is a release build script that doesn't have any automated test
coverage. We will know if it goes wrong when we prepare releases.
Author: Reynold Xin <[email protected]>
Closes #16072 from rxin/SPARK-18639.
(cherry picked from commit 37e52f8793bff306a7ae5a9aecc16f28333b70e3)
Signed-off-by: Reynold Xin <[email protected]>
commit 2f91b0154ee0674b65e80f81f6498b94666c4b46
Author: sureshthalamati <[email protected]>
Date: 2016-12-02T03:13:38Z
[SPARK-18141][SQL] Fix to quote column names in the predicate clause of
the JDBC RDD generated sql statement
## What changes were proposed in this pull request?
SQL query generated for the JDBC data source is not quoting columns in the
predicate clause. When the source table has quoted column names, spark jdbc
read fails with column not found error incorrectly.
Error:
org.h2.jdbc.JdbcSQLException: Column "ID" not found;
Source SQL statement:
SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE (Id < 1)
This PR fixes by quoting column names in the generated SQL for predicate
clause when filters are pushed down to the data source.
Source SQL statement after the fix:
SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE ("Id" < 1)
## How was this patch tested?
Added new test case to the JdbcSuite
Author: sureshthalamati <[email protected]>
Closes #15662 from sureshthalamati/filter_quoted_cols-SPARK-18141.
(cherry picked from commit 70c5549ee9588228d18a7b405c977cf591e2efd4)
Signed-off-by: gatorsmile <[email protected]>
commit b9eb10043129defa53c5bdfd1190fe68c0107b3b
Author: gatorsmile <[email protected]>
Date: 2016-12-02T03:15:26Z
[SPARK-18538][SQL][BACKPORT-2.1] Fix Concurrent Table Fetching Using
DataFrameReader JDBC APIs
### What changes were proposed in this pull request?
#### This PR is to backport https://github.com/apache/spark/pull/15975 to
Branch 2.1
---
The following two `DataFrameReader` JDBC APIs ignore the user-specified
parameters of parallelism degree.
```Scala
def jdbc(
url: String,
table: String,
columnName: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
connectionProperties: Properties): DataFrame
```
```Scala
def jdbc(
url: String,
table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame
```
This PR is to fix the issues. To verify the behavior correctness, we
improve the plan output of `EXPLAIN` command by adding `numPartitions` in the
`JDBCRelation` node.
Before the fix,
```
== Physical Plan ==
*Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema:
struct<NAME:string,THEID:int>
```
After the fix,
```
== Physical Plan ==
*Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897]
ReadSchema: struct<NAME:string,THEID:int>
```
### How was this patch tested?
Added the verification logics on all the test cases for JDBC concurrent
fetching.
Author: gatorsmile <[email protected]>
Closes #16111 from gatorsmile/jdbcFix2.1.
commit fce1be6cc81b1fe3991a4df91128f4fcd14ff615
Author: Kazuaki Ishizaki <[email protected]>
Date: 2016-12-02T04:30:13Z
[SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise
## What changes were proposed in this pull request?
This PR makes `ExpressionEncoder.serializer.nullable` for flat encoder for
a primitive type `false`. Since it is `true` for now, it is too conservative.
While `ExpressionEncoder.schema` has correct information (e.g.
`<IntegerType, false>`), `serializer.head.nullable` of `ExpressionEncoder`,
which got from `encoderFor[T]`, is always false. It is too conservative.
This is accomplished by checking whether a type is one of primitive types.
If it is `true`, `nullable` should be `false`.
## How was this patch tested?
Added new tests for encoder and dataframe
Author: Kazuaki Ishizaki <[email protected]>
Closes #15780 from kiszk/SPARK-18284.
(cherry picked from commit 38b9e69623c14a675b14639e8291f5d29d2a0bc3)
Signed-off-by: Wenchen Fan <[email protected]>
commit 0f0903d17b9c71a569d92f2c35e2caeb1eb8c89f
Author: Wenchen Fan <[email protected]>
Date: 2016-12-02T04:54:12Z
[SPARK-18647][SQL] do not put provider in table properties for Hive serde
table
## What changes were proposed in this pull request?
In Spark 2.1, we make Hive serde tables case-preserving by putting the
table metadata in table properties, like what we did for data source table.
However, we should not put table provider, as it will break forward
compatibility. e.g. if we create a Hive serde table with Spark 2.1, using
`sql("create table test stored as parquet as select 1")`, we will fail to read
it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table
because there is a `provider` entry in table properties.
Logically Hive serde table's provider is always hive, we don't need to
store it in table properties, this PR removes it.
## How was this patch tested?
manually test the forward compatibility issue.
Author: Wenchen Fan <[email protected]>
Closes #16080 from cloud-fan/hive.
(cherry picked from commit a5f02b00291e0a22429a3dca81f12cf6d38fea0b)
Signed-off-by: Wenchen Fan <[email protected]>
commit a7f8ebb8629706c54c286b7aca658838e718e804
Author: Cheng Lian <[email protected]>
Date: 2016-12-02T06:02:45Z
[SPARK-17213][SQL] Disable Parquet filter push-down for string and binary
columns due to PARQUET-686
This PR targets to both master and branch-2.1.
## What changes were proposed in this pull request?
Due to PARQUET-686, Parquet doesn't do string comparison correctly while
doing filter push-down for string columns. This PR disables filter push-down
for both string and binary columns to work around this issue. Binary columns
are also affected because some Parquet data models (like Hive) may store string
columns as a plain Parquet `binary` instead of a `binary (UTF8)`.
## How was this patch tested?
New test case added in `ParquetFilterSuite`.
Author: Cheng Lian <[email protected]>
Closes #16106 from liancheng/spark-17213-bad-string-ppd.
(cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258)
Signed-off-by: Reynold Xin <[email protected]>
commit 65e896a6e9a5378f2d3a02c0c2a57fdb8d8f1d9d
Author: Eric Liang <[email protected]>
Date: 2016-12-02T12:59:39Z
[SPARK-18679][SQL] Fix regression in file listing performance for
non-catalog tables
## What changes were proposed in this pull request?
In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed
to InMemoryFileIndex). This introduced a regression where parallelism could
only be introduced at the very top of the tree. However, in many cases (e.g.
`spark.read.parquet(topLevelDir)`), the top of the tree is only a single
directory.
This PR simplifies and fixes the parallel recursive listing code to allow
parallelism to be introduced at any level during recursive descent (though note
that once we decide to list a sub-tree in parallel, the sub-tree is listed in
serial on executors).
cc mallman cloud-fan
## How was this patch tested?
Checked metrics in unit tests.
Author: Eric Liang <[email protected]>
Closes #16112 from ericl/spark-18679.
(cherry picked from commit 294163ee9319e4f7f6da1259839eb3c80bba25c2)
Signed-off-by: Wenchen Fan <[email protected]>
commit 415730e19cea3a0e7ea5491bf801a22859bbab66
Author: Dongjoon Hyun <[email protected]>
Date: 2016-12-02T13:48:22Z
[SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options
## What changes were proposed in this pull request?
Currently, `JDBCRelation.insert` removes Spark options too early by
mistakenly using `asConnectionProperties`. Spark options like `numPartitions`
should be passed into `DataFrameWriter.jdbc` correctly. This bug have been
**hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the
mixed-case options. This PR aims to fix both.
**JDBCRelation.insert**
```scala
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
val url = jdbcOptions.url
val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
data.write
.mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
.jdbc(url, table, properties)
```
**JDBCOptions.asConnectionProperties**
```scala
scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp",
"dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" ->
"jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" ->
"10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}
```
## How was this patch tested?
Pass the Jenkins with a new testcase.
Author: Dongjoon Hyun <[email protected]>
Closes #15863 from dongjoon-hyun/SPARK-18419.
(cherry picked from commit 55d528f2ba0ba689dbb881616d9436dc7958e943)
Signed-off-by: Wenchen Fan <[email protected]>
commit e374b2426114d841e1935719f6e21919475f6804
Author: Eric Liang <[email protected]>
Date: 2016-12-02T13:59:02Z
[SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource
tables
## What changes were proposed in this pull request?
Two bugs are addressed here
1. INSERT OVERWRITE TABLE sometime crashed when catalog partition
management was enabled. This was because when dropping partitions after an
overwrite operation, the Hive client will attempt to delete the partition
files. If the entire partition directory was dropped, this would fail. The PR
fixes this by adding a flag to control whether the Hive client should attempt
to delete files.
2. The static partition spec for OVERWRITE TABLE was not correctly resolved
to the case-sensitive original partition names. This resulted in the entire
table being overwritten if you did not correctly capitalize your partition
names.
cc yhuai cloud-fan
## How was this patch tested?
Unit tests. Surprisingly, the existing overwrite table tests did not catch
these edge cases.
Author: Eric Liang <[email protected]>
Closes #16088 from ericl/spark-18659.
(cherry picked from commit 7935c8470c5c162ef7213e394fe8588e5dd42ca2)
Signed-off-by: Wenchen Fan <[email protected]>
commit 32c85383bfd6210e96b4bbcdedbe27a88935e4c7
Author: gatorsmile <[email protected]>
Date: 2016-12-02T14:12:19Z
[SPARK-18674][SQL][FOLLOW-UP] improve the error message of using join
### What changes were proposed in this pull request?
Added a test case for using joins with nested fields.
### How was this patch tested?
N/A
Author: gatorsmile <[email protected]>
Closes #16110 from gatorsmile/followup-18674.
(cherry picked from commit 2f8776ccad532fbed17381ff97d302007918b8d8)
Signed-off-by: Wenchen Fan <[email protected]>
commit c69825a98989ee975dc8b87979e29e0fff15a3f7
Author: Ryan Blue <[email protected]>
Date: 2016-12-02T16:41:40Z
[SPARK-18677] Fix parsing ['key'] in JSON path expressions.
## What changes were proposed in this pull request?
This fixes the parser rule to match named expressions, which doesn't work
for two reasons:
1. The name match is not coerced to a regular expression (missing .r)
2. The surrounding literals are incorrect and attempt to escape a single
quote, which is unnecessary
## How was this patch tested?
This adds test cases for named expressions using the bracket syntax,
including one with quoted spaces.
Author: Ryan Blue <[email protected]>
Closes #16107 from rdblue/SPARK-18677-fix-json-path.
(cherry picked from commit 48778976e0566d9c93a8c900825def82c6b81fd6)
Signed-off-by: Herman van Hovell <[email protected]>
commit f915f8128bd47b9d668065f848d5d437365e564a
Author: Yanbo Liang <[email protected]>
Date: 2016-12-02T20:16:57Z
[SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm
predict should output original label when family = binomial."
## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for
users to change the ```predict``` output schema, then they could output
probabilities, log-space predictions, or original labels. In order to not
involve breaking API change for 2.1, so revert this change firstly and will add
it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618)
resolved.
## How was this patch tested?
Existing unit tests.
This reverts commit daa975f4bfa4f904697bf3365a4be9987032e490.
Author: Yanbo Liang <[email protected]>
Closes #16118 from yanboliang/spark-18291-revert.
(cherry picked from commit a985dd8e99d2663a3cb4745c675fa2057aa67155)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit f53763275ae1b74925e4123dd87f567798f16ba1
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-02T20:42:47Z
[SPARK-18670][SS] Limit the number of
StreamingQueryListener.StreamProgressEvent when there is no data
## What changes were proposed in this pull request?
This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to
control how long to wait before outputing the next StreamProgressEvent when
there is no data.
## How was this patch tested?
The added unit test.
Author: Shixiong Zhu <[email protected]>
Closes #16108 from zsxwing/SPARK-18670.
(cherry picked from commit 56a503df5ccbb233ad6569e22002cc989e676337)
Signed-off-by: Tathagata Das <[email protected]>
commit 839d4e9ca94b132732225632e8c50364e53579a0
Author: Yanbo Liang <[email protected]>
Date: 2016-12-03T00:28:01Z
[SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1
release
## What changes were proposed in this pull request?
Update ML programming and migration guide for 2.1 release.
## How was this patch tested?
Doc change, no test.
Author: Yanbo Liang <[email protected]>
Closes #16076 from yanboliang/spark-18324.
(cherry picked from commit 2dc0d7efe3380a5763cb69ef346674a46f8e3d57)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit cf3dbec68d379763ee541bf3b7a4809e1f2d0cb7
Author: zero323 <[email protected]>
Date: 2016-12-03T01:39:28Z
[SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames
## What changes were proposed in this pull request?
Makes `Window.unboundedPreceding` and `Window.unboundedFollowing` backward
compatible.
## How was this patch tested?
Pyspark SQL unittests.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: zero323 <[email protected]>
Closes #16123 from zero323/SPARK-17845-follow-up.
(cherry picked from commit a9cbfc4f6a8db936215fcf64697d5b65f13f666e)
Signed-off-by: Reynold Xin <[email protected]>
commit 28ea432a26953866eaf95b2fd32a251ecf0c8094
Author: hyukjinkwon <[email protected]>
Date: 2016-12-03T10:12:28Z
[SPARK-18685][TESTS] Fix URI and release resources after opening in tests
at ExecutorClassLoaderSuite
## What changes were proposed in this pull request?
This PR fixes two problems as below:
- Close `BufferedSource` after `Source.fromInputStream(...)` to release
resource and make the tests pass on Windows in `ExecutorClassLoaderSuite`
```
[info] Exception encountered when attempting to run a suite with class
name: org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7
seconds, 333 milliseconds)
[info] java.io.IOException: Failed to delete:
C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
[info] at
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
[info] at
org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
[info] at
org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
...
```
- Fix URI correctly so that related tests can be passed on Windows.
```
[info] - child first *** FAILED *** (78 milliseconds)
[info] java.net.URISyntaxException: Illegal character in authority at
index 7:
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - parent first *** FAILED *** (15 milliseconds)
[info] java.net.URISyntaxException: Illegal character in authority at
index 7:
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - child first can fall back *** FAILED *** (0 milliseconds)
[info] java.net.URISyntaxException: Illegal character in authority at
index 7:
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - child first can fail *** FAILED *** (0 milliseconds)
[info] java.net.URISyntaxException: Illegal character in authority at
index 7:
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - resource from parent *** FAILED *** (0 milliseconds)
[info] java.net.URISyntaxException: Illegal character in authority at
index 7:
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - resources from parent *** FAILED *** (0 milliseconds)
[info] java.net.URISyntaxException: Illegal character in authority at
index 7:
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.parseAuthority(URI.java:3186)
```
## How was this patch tested?
Manually tested via AppVeyor.
**Before**
https://ci.appveyor.com/project/spark-test/spark/build/102-rpel-ExecutorClassLoaderSuite
**After**
https://ci.appveyor.com/project/spark-test/spark/build/108-rpel-ExecutorClassLoaderSuite
Author: hyukjinkwon <[email protected]>
Closes #16116 from HyukjinKwon/close-after-open.
(cherry picked from commit d1312fb7edffd6e10c86f69ddfff05f8915856ac)
Signed-off-by: Sean Owen <[email protected]>
commit b098b4845c557a3139c76caa0377c3049b6fe8aa
Author: Nattavut Sutyanyong <[email protected]>
Date: 2016-12-03T19:36:26Z
[SPARK-18582][SQL] Whitelist LogicalPlan operators allowed in correlated
subqueries
## What changes were proposed in this pull request?
This fix puts an explicit list of operators that Spark supports for
correlated subqueries.
## How was this patch tested?
Run sql/test, catalyst/test and add a new test case on Generate.
Author: Nattavut Sutyanyong <[email protected]>
Closes #16046 from nsyca/spark18455.0.
(cherry picked from commit 4a3c09601ba69f7d49d1946bb6f20f5cfe453031)
Signed-off-by: Herman van Hovell <[email protected]>
commit 28f698b4845e6497d060270ba790cc60dc7e1a6e
Author: Yunni <[email protected]>
Date: 2016-12-04T00:58:15Z
[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH)
## What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java
examples in spark-examples.
## How was this patch tested?
Doc has been generated through Jekyll, and checked through manual
inspection.
Author: Yunni <[email protected]>
Author: Yun Ni <[email protected]>
Author: Joseph K. Bradley <[email protected]>
Author: Yun Ni <[email protected]>
Closes #15795 from Yunni/SPARK-18081-lsh-guide.
(cherry picked from commit 34777184cd8cab61e1dd25d0a4d5e738880a57b2)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 8145c82bc8e4c44e7b74695e2307bb837cde1207
Author: Kapil Singh <[email protected]>
Date: 2016-12-04T09:16:40Z
[SPARK-18091][SQL] Deep if expressions cause Generated
SpecificUnsafeProjection code to exceed JVM code size limit
## What changes were proposed in this pull request?
Fix for SPARK-18091 which is a bug related to large if expressions causing
generated SpecificUnsafeProjection code to exceed JVM code size limit.
This PR changes if expression's code generation to place its predicate,
true value and false value expressions' generated code in separate methods in
context so as to never generate too long combined code.
## How was this patch tested?
Added a unit test and also tested manually with the application (having
transformations similar to the unit test) which caused the issue to be
identified in the first place.
Author: Kapil Singh <[email protected]>
Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
(cherry picked from commit e463678b194e08be4a8bc9d1d45461d6c77a15ee)
Signed-off-by: Wenchen Fan <[email protected]>
commit 41d698ecead46979e9a77b21e6a9c8f27cff63ac
Author: Eric Liang <[email protected]>
Date: 2016-12-04T12:44:04Z
[SPARK-18661][SQL] Creating a partitioned datasource table should not scan
all files for table
## What changes were proposed in this pull request?
Even though in 2.1 creating a partitioned datasource table will not
populate the partition data by default (until the user issues MSCK REPAIR
TABLE), it seems we still scan the filesystem for no good reason.
We should avoid doing this when the user specifies a schema.
## How was this patch tested?
Perf stat tests.
Author: Eric Liang <[email protected]>
Closes #16090 from ericl/spark-18661.
(cherry picked from commit d9eb4c7215f26dd05527c0b9980af35087ab9d64)
Signed-off-by: Wenchen Fan <[email protected]>
commit c13c2939fb19901d86ee013aa7bb5e200d79be85
Author: Felix Cheung <[email protected]>
Date: 2016-12-05T04:25:11Z
[SPARK-18643][SPARKR] SparkR hangs at session start when installed as a
package without Spark
## What changes were proposed in this pull request?
If SparkR is running as a package and it has previously downloaded Spark
Jar it should be able to run as before without having to set SPARK_HOME.
Basically with this bug the auto install Spark will only work in the first
session.
This seems to be a regression on the earlier behavior.
Fix is to always try to install or check for the cached Spark if running in
an interactive session.
As discussed before, we should probably only install Spark iff running in
an interactive session (R shell, RStudio etc)
## How was this patch tested?
Manually
Author: Felix Cheung <[email protected]>
Closes #16077 from felixcheung/rsessioninteractive.
(cherry picked from commit b019b3a8ac49336e657f5e093fa2fba77f8d12d2)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit 88e07efe86512142eeada6a6f1f7fe858204c59b
Author: Zheng RuiFeng <[email protected]>
Date: 2016-12-05T08:32:58Z
[SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and
setPredictionCol
## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`
## How was this patch tested?
added tests
Author: Zheng RuiFeng <[email protected]>
Closes #16059 from zhengruifeng/ovrm_setCol.
(cherry picked from commit bdfe7f67468ecfd9927a1fec60d6605dd05ebe3f)
Signed-off-by: Yanbo Liang <[email protected]>
commit 1821cbead1875fbe1c16d7c50563aa0839e1f70f
Author: Yanbo Liang <[email protected]>
Date: 2016-12-05T08:39:44Z
[SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.
## What changes were proposed in this pull request?
Add R examples to ML programming guide for the following algorithms as POC:
* spark.glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans
The four algorithms were added to SparkR since 2.0.0, more docs for
algorithms added during 2.1 release cycle will be addressed in a separate
follow-up PR.
## How was this patch tested?
This is the screenshots of generated ML programming guide for
```GeneralizedLinearRegression```:

Author: Yanbo Liang <[email protected]>
Closes #16136 from yanboliang/spark-18279.
(cherry picked from commit eb8dd68132998aa00902dfeb935db1358781e1c1)
Signed-off-by: Yanbo Liang <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]