[GitHub] spark pull request #16748: [BACKPORT-2.1][SPARKR][DOCS] update R API doc for...

felixcheung Mon, 30 Jan 2017 18:53:37 -0800

GitHub user felixcheung opened a pull request:

    https://github.com/apache/spark/pull/16748


    [BACKPORT-2.1][SPARKR][DOCS] update R API doc for subset/extract

    ## What changes were proposed in this pull request?
    
    backport #16721 to branch-2.1
    
    ## How was this patch tested?
    
    manual

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/felixcheung/spark rsubsetdocbackport

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16748
    
----
commit 5e4afbfb6e3993533cb0ab1bece2ea504801a7cb
Author: uncleGen <[email protected]>
Date:   2016-11-30T07:45:06Z

    [SPARK-18617][CORE][STREAMING] Close "kryo auto pick" feature for Spark 
Streaming
    
    ## What changes were proposed in this pull request?
    
    #15992 provided a solution to fix the bug, i.e. **receiver data can not be 
deserialized properly**. As zsxwing said, it is a critical bug, but we should 
not break APIs between maintenance releases. It may be a rational choice to 
close auto pick kryo serializer for Spark Streaming in the first step. I will 
continue #15992 to optimize the solution.
    
    ## How was this patch tested?
    
    existing ut
    
    Author: uncleGen <[email protected]>
    
    Closes #16052 from uncleGen/SPARK-18617.
    
    (cherry picked from commit 56c82edabd62db9e936bb9afcf300faf8ef39362)
    Signed-off-by: Reynold Xin <[email protected]>

commit 7043c6b695f77741c5e97a322d9590bd714289de
Author: Sandeep Singh <[email protected]>
Date:   2016-11-30T09:33:15Z

    [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for 
QuantileDiscretizer and Bucketizer
    
    ## What changes were proposed in this pull request?
    added the new handleInvalid param for these transformers to Python to 
maintain API parity.
    
    ## How was this patch tested?
    existing tests
    testing is done with new doctests
    
    Author: Sandeep Singh <[email protected]>
    
    Closes #15817 from techaddict/SPARK-18366.
    
    (cherry picked from commit fe854f2e4fb2fa1a1c501f11030e36f489ca546f)
    Signed-off-by: Nick Pentreath <[email protected]>

commit 05ba5eed71309e104feb1951aa8197e4336cdb2a
Author: Anthony Truchet <[email protected]>
Date:   2016-11-30T10:04:47Z

    [SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun
    
    ## What changes were proposed in this pull request?
    
    Fix a broadcasted variable leak occurring at each invocation of CostFun in 
L-BFGS.
    
    ## How was this patch tested?
    
    UTests + check that fixed fatal memory consumption on Criteo's use cases.
    
    This contribution is made on behalf of Criteo S.A.
    (http://labs.criteo.com/) under the terms of the Apache v2 License.
    
    Author: Anthony Truchet <[email protected]>
    
    Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.
    
    (cherry picked from commit c5a64d760600ff430899e401751c41dc6b27cee6)
    Signed-off-by: Sean Owen <[email protected]>

commit 6e044ab9a9d417fb12d53f6327b90d9166c01f35
Author: gatorsmile <[email protected]>
Date:   2016-11-30T11:40:58Z

    [SPARK-17897][SQL] Fixed IsNotNull Constraint Inference Rule
    
    ### What changes were proposed in this pull request?
    The `constraints` of an operator is the expressions that evaluate to `true` 
for all the rows produced. That means, the expression result should be neither 
`false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the 
constraints, which are generated by its own predicates or propagated from the 
children. The constraint can be a complex expression. For better usage of these 
constraints, we try to push down `IsNotNull` to the lowest-level expressions 
(i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is 
null intolerant. (When the input is NULL, the null-intolerant expression always 
evaluates to NULL.)
    
    Below is the existing code we have for `IsNotNull` pushdown.
    ```Scala
      private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = 
expr match {
        case a: Attribute => Seq(a)
        case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
          expr.children.flatMap(scanNullIntolerantExpr)
        case _ => Seq.empty[Attribute]
      }
    ```
    
    **`IsNotNull` itself is not null-intolerant.** It converts `null` to 
`false`. If the expression does not include any `Not`-like expression, it 
works; otherwise, it could generate a wrong result. This PR is to fix the above 
function by removing the `IsNotNull` from the inference. After the fix, when a 
constraint has a `IsNotNull` expression, we infer new attribute-specific 
`IsNotNull` constraints if and only if `IsNotNull` appears in the root.
    
    Without the fix, the following test case will return empty.
    ```Scala
    val data = Seq[java.lang.Integer](1, null).toDF("key")
    data.filter("not key is not null").show()
    ```
    Before the fix, the optimized plan is like
    ```
    == Optimized Logical Plan ==
    Project [value#1 AS key#3]
    +- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
       +- LocalRelation [value#1]
    ```
    
    After the fix, the optimized plan is like
    ```
    == Optimized Logical Plan ==
    Project [value#1 AS key#3]
    +- Filter NOT isnotnull(value#1)
       +- LocalRelation [value#1]
    ```
    
    ### How was this patch tested?
    Added a test
    
    Author: gatorsmile <[email protected]>
    
    Closes #16067 from gatorsmile/isNotNull2.
    
    (cherry picked from commit 2eb093decb5e87a1ea71bbaa28092876a8c84996)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 3de93fb480ce316e9b35a025dd350123084c3565
Author: Wenchen Fan <[email protected]>
Date:   2016-11-30T17:47:30Z

    [SPARK-18220][SQL] read Hive orc table with varchar column should not fail
    
    ## What changes were proposed in this pull request?
    
    Spark SQL only has `StringType`, when reading hive table with varchar 
column, we will read that column as `StringType`. However, we still need to use 
varchar `ObjectInspector` to read varchar column in hive table, which means we 
need to know the actual column type at hive side.
    
    In Spark 2.1, after https://github.com/apache/spark/pull/14363 , we parse 
hive type string to catalyst type, which means the actual column type at hive 
side is erased. Then we may use string `ObjectInspector` to read varchar column 
and fail.
    
    This PR keeps the original hive column type string in the metadata of 
`StructField`, and use it when we convert it to a hive column.
    
    ## How was this patch tested?
    
    newly added regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #16060 from cloud-fan/varchar.
    
    (cherry picked from commit 3f03c90a807872d47588f3c3920769b8978033bf)
    Signed-off-by: Reynold Xin <[email protected]>

commit eae85da388e27c7eda8be3933f673ad7f1a3c6af
Author: manishAtGit <[email protected]>
Date:   2016-11-30T19:46:50Z

    [SPARK][EXAMPLE] Added missing semicolon in quick-start-guide example
    
    ## What changes were proposed in this pull request?
    
    Added missing semicolon in quick-start-guide java example code which wasn't 
compiling before.
    
    ## How was this patch tested?
    Locally by running and generating site for docs. You can see the last line 
contains ";" in the below snapshot.
    
![image](https://cloud.githubusercontent.com/assets/10628224/20751760/9a7e0402-b723-11e6-9aa8-3b6ca2d92ebf.png)
    
    Author: manishAtGit <[email protected]>
    
    Closes #16081 from manishatGit/fixed-quick-start-guide.
    
    (cherry picked from commit bc95ea0be5b880673d452f5eec47fbfd403d94ce)
    Signed-off-by: Andrew Or <[email protected]>

commit 7c0e2962d5e0fb80e4472d29dd467477f1cbcf8a
Author: Josh Rosen <[email protected]>
Date:   2016-11-30T19:47:41Z

    [SPARK-18640] Add synchronization to TaskScheduler.runningTasksByExecutors
    
    ## What changes were proposed in this pull request?
    
    The method `TaskSchedulerImpl.runningTasksByExecutors()` accesses the 
mutable `executorIdToRunningTaskIds` map without proper synchronization. In 
addition, as markhamstra pointed out in #15986, the signature's use of 
parentheses is a little odd given that this is a pure getter method.
    
    This patch fixes both issues.
    
    ## How was this patch tested?
    
    Covered by existing tests.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #16073 from JoshRosen/runningTasksByExecutors-thread-safety.
    
    (cherry picked from commit c51c7725944d60738e2bac3e11f6aea74812905c)
    Signed-off-by: Andrew Or <[email protected]>

commit f542df3107e6161f90a7394a36ab95932a0b3425
Author: Yanbo Liang <[email protected]>
Date:   2016-11-30T21:21:05Z

    [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs
    
    ## What changes were proposed in this pull request?
    API review for 2.1, except ```LSH``` related classes which are still under 
development.
    
    ## How was this patch tested?
    Only doc changes, no new tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #16009 from yanboliang/spark-18318.
    
    (cherry picked from commit 60022bfd65e4637efc0eb5f4cc0112289c783147)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 9e96ac5a986c53ca1689e3d1f1365cc5107b5d88
Author: Wenchen Fan <[email protected]>
Date:   2016-11-30T21:36:17Z

    [SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type
    
    ## What changes were proposed in this pull request?
    
    For input object of non-flat type, we can't encode it to row if it's null, 
as Spark SQL doesn't allow the entire row to be null, only its columns can be 
null. That's the reason we forbid users to use top level null objects in 
https://github.com/apache/spark/pull/13469
    
    However, if users wrap non-flat type with `Option`, then we may still 
encoder top level null object to row, which is not allowed.
    
    This PR fixes this case, and suggests users to wrap their type with 
`Tuple1` if they do wanna top level null objects.
    
    ## How was this patch tested?
    
    new test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #15979 from cloud-fan/option.
    
    (cherry picked from commit f135b70fd590438bebb2a54012a6f73074219758)
    Signed-off-by: Cheng Lian <[email protected]>

commit c2c2fdcb71e9bc82f0e88567148d1bae283f256a
Author: Marcelo Vanzin <[email protected]>
Date:   2016-11-30T22:10:32Z

    [SPARK-18546][CORE] Fix merging shuffle spills when using encryption.
    
    The problem exists because it's not possible to just concatenate encrypted
    partition data from different spill files; currently each partition would
    have its own initial vector to set up encryption, and the final merged file
    should contain a single initial vector for each merged partiton, otherwise
    iterating over each record becomes really hard.
    
    To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
    so that the merged file contains a single initial vector at the start of
    the partition data.
    
    Because it's not possible to do that using the fast transferTo path, when
    encryption is enabled UnsafeShuffleWriter will revert back to using file
    streams when merging. It may be possible to use a hybrid approach when
    using encryption, using an intermediate direct buffer when reading from
    files and encrypting the data, but that's better left for a separate patch.
    
    As part of the change I made DiskBlockObjectWriter take a SerializerManager
    instead of a "wrap stream" closure, since that makes it easier to test the
    code without having to mock SerializerManager functionality.
    
    Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
    side and ExternalAppendOnlyMapSuite for integration), and by running some
    apps that failed without the fix.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #15982 from vanzin/SPARK-18546.
    
    (cherry picked from commit 93e9d880bf8a144112d74a6897af4e36fcfa5807)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 6e2e987bd8d4f4417b6fd6ff15dc2f38e9c7e661
Author: Shixiong Zhu <[email protected]>
Date:   2016-12-01T00:18:53Z

    [SPARK-18655][SS] Ignore Structured Streaming 2.0.2 logs in history server
    
    ## What changes were proposed in this pull request?
    
    As `queryStatus` in StreamingQueryListener events was removed in #15954, 
parsing 2.0.2 structured streaming logs will throw the following errror:
    
    ```
    [info]   com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
Unrecognized field "queryStatus" (class 
org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), 
not marked as ignorable (2 known properties: "id", "exception"])
    [info]  at [Source: 
{"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null};
 line: 1, column: 521] (through reference chain: 
org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"])
    [info]   at 
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
    [info]   at 
com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
    [info]   at 
com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
    [info]   at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
    [info]   at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
    [info]   at 
com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453)
    [info]   at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
    ...
    ```
    
    This PR just ignores such errors and adds a test to make sure we can read 
2.0.2 logs.
    
    ## How was this patch tested?
    
    `query-event-logs-version-2.0.2.txt` has all types of events generated by 
Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should 
ignore broken event jsons generated in 2.0.2")` verified we can load them 
without any error.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16085 from zsxwing/SPARK-18655.
    
    (cherry picked from commit c4979f6ea8ed44fd87ded3133efa6df39d4842c3)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 7d4596734b6ebd021adc32ff87aa859bc2eeb976
Author: Shixiong Zhu <[email protected]>
Date:   2016-12-01T01:41:43Z

    [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite. 
Receiver data should be deserialized properly
    
    ## What changes were proposed in this pull request?
    
    Fixed the potential SparkContext leak in `StreamingContextSuite.SPARK-18560 
Receiver data should be deserialized properly` which was added in #16052. I 
also removed FakeByteArrayReceiver and used TestReceiver directly.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16091 from zsxwing/SPARK-18617-follow-up.
    
    (cherry picked from commit 0a811210f809eb5b80eae14694d484d45b48b3f6)
    Signed-off-by: Reynold Xin <[email protected]>

commit e8d8e350998e6e44a6dee7f78dbe2d1aa997c1d6
Author: [email protected] <[email protected]>
Date:   2016-12-01T04:32:17Z

    [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support 
output original label.
    
    ## What changes were proposed in this pull request?
    
    Similar to SPARK-18401, as a classification algorithm, logistic regression 
should support output original label instead of supporting index label.
    
    In this PR, original label output is supported and test cases are modified 
and added. Document is also modified.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: [email protected] <[email protected]>
    
    Closes #15910 from wangmiao1981/audit.
    
    (cherry picked from commit 2eb6764fbb23553fc17772d8a4a1cad55ff7ba6e)
    Signed-off-by: Yanbo Liang <[email protected]>

commit 9dc3ef6e11b7dd3fd916d1442733938dcb5750e3
Author: Eric Liang <[email protected]>
Date:   2016-12-01T08:48:10Z

    [SPARK-18635][SQL] Partition name/values not escaped correctly in some cases
    
    ## What changes were proposed in this pull request?
    
    Due to confusion between URI vs paths, in certain cases we escape partition 
values too many times, which causes some Hive client operations to fail or 
write data to the wrong location. This PR fixes at least some of these cases.
    
    To my understanding this is how values, filesystem paths, and URIs interact.
    - Hive stores raw (unescaped) partition values that are returned to you 
directly when you call listPartitions.
    - Internally, we convert these raw values to filesystem paths via 
`ExternalCatalogUtils.[un]escapePathName`.
    - In some circumstances we store URIs instead of filesystem paths. When a 
path is converted to a URI via `path.toURI`, the escaped partition values are 
further URI-encoded. This means that to get a path back from a URI, you must 
call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
    - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to 
forget to URI-decode the value before converting it into a path.
    - Finally, the Hive client itself uses mostly Paths for representing 
locations, and only URIs occasionally.
    
    In the future we should probably clean this up, perhaps by dropping use of 
URIs when unnecessary. We should also try fixing escaping for partition names 
as well as values, though names are unlikely to contain special characters.
    
    cc mallman cloud-fan yhuai
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Eric Liang <[email protected]>
    
    Closes #16071 from ericl/spark-18635.
    
    (cherry picked from commit 88f559f20a5208f2386b874eb119f1cba2c748c7)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 8579ab5d7092a65f044fd925ecd5b790305f0aef
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-12-01T09:57:58Z

    [SPARK-18666][WEB UI] Remove the codes checking deprecated config 
spark.sql.unsafe.enabled
    
    ## What changes were proposed in this pull request?
    
    `spark.sql.unsafe.enabled` is deprecated since 1.6. There still are codes 
in UI to check it. We should remove it and clean the codes.
    
    ## How was this patch tested?
    
    Changes to related existing unit test.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #16095 from viirya/remove-deprecated-config-code.
    
    (cherry picked from commit dbf842b7a8479f9566146192ffc04421591742d5)
    Signed-off-by: Reynold Xin <[email protected]>

commit cbbe217777173b100de2f5a613c46428974826f6
Author: Yuming Wang <[email protected]>
Date:   2016-12-01T13:14:09Z

    [SPARK-18645][DEPLOY] Fix spark-daemon.sh arguments error lead to throws 
Unrecognized option
    
    ## What changes were proposed in this pull request?
    
    spark-daemon.sh will lost single quotes around after #15338. as follows:
    ```
    execute_command nice -n 0 bash 
/opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift 
JDBC/ODBC Server --conf spark.driver.extraJavaOptions=-XX:+UseG1GC 
-XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
    ```
    With this fix, as follows:
    ```
    execute_command nice -n 0 bash 
/opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name 
'Thrift JDBC/ODBC Server' --conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC 
-XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'
    ```
    
    ## How was this patch tested?
    
    - Manual tests
    - Build the package and start-thriftserver.sh with `--conf 
'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp'`
    
    Author: Yuming Wang <[email protected]>
    
    Closes #16079 from wangyum/SPARK-18645.
    
    (cherry picked from commit 2ab8551e79e1655c406c358b21c0a1e719f498be)
    Signed-off-by: Sean Owen <[email protected]>

commit 6916ddc385fc33fa390e541300ca2bb1dbd0599c
Author: Wenchen Fan <[email protected]>
Date:   2016-12-01T19:53:12Z

    [SPARK-18674][SQL] improve the error message of using join
    
    ## What changes were proposed in this pull request?
    
    The current error message of USING join is quite confusing, for example:
    ```
    scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
    df1: org.apache.spark.sql.DataFrame = [c1: int]
    
    scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
    df2: org.apache.spark.sql.DataFrame = [c2: int]
    
    scala> df1.join(df2, usingColumn = "c1")
    org.apache.spark.sql.AnalysisException: using columns ['c1] can not be 
resolved given input columns: [c1, c2] ;;
    'Join UsingJoin(Inner,List('c1))
    :- Project [value#1 AS c1#3]
    :  +- LocalRelation [value#1]
    +- Project [value#7 AS c2#9]
       +- LocalRelation [value#7]
    ```
    
    after this PR, it becomes:
    ```
    scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
    df1: org.apache.spark.sql.DataFrame = [c1: int]
    
    scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
    df2: org.apache.spark.sql.DataFrame = [c2: int]
    
    scala> df1.join(df2, usingColumn = "c1")
    org.apache.spark.sql.AnalysisException: USING column `c1` can not be 
resolved with the right join side, the right output is: [c2];
    ```
    
    ## How was this patch tested?
    
    updated tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #16100 from cloud-fan/natural.
    
    (cherry picked from commit e6534847100670a22b3b191a0f9d924fab7f3c02)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 4c673c656d52d29813979e942851b9205e4ace06
Author: Sandeep Singh <[email protected]>
Date:   2016-12-01T21:22:40Z

    [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper
    
    ## What changes were proposed in this pull request?
    In`JavaWrapper `'s destructor make Java Gateway dereference object in 
destructor, using `SparkContext._active_spark_context._gateway.detach`
    Fixing the copying parameter bug, by moving the `copy` method from 
`JavaModel` to `JavaParams`
    
    ## How was this patch tested?
    ```scala
    import random, string
    from pyspark.ml.feature import StringIndexer
    
    l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
for _ in range(int(7e5))]  # 700000 random strings of 10 characters
    df = spark.createDataFrame(l, ['string'])
    
    for i in range(50):
        indexer = StringIndexer(inputCol='string', outputCol='index')
        indexer.fit(df)
    ```
    * Before: would keep StringIndexer strong reference, causing GC issues and 
is halted midway
    After: garbage collection works as the object is dereferenced, and 
computation completes
    * Mem footprint tested using profiler
    * Added a parameter copy related test which was failing before.
    
    Author: Sandeep Singh <[email protected]>
    Author: jkbradley <[email protected]>
    
    Closes #15843 from techaddict/SPARK-18274.
    
    (cherry picked from commit 78bb7f8071379114314c394e0167c4c5fd8545c5)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 4746674ad3acfc38bbd3e2708d75280c19ef0202
Author: Shixiong Zhu <[email protected]>
Date:   2016-12-01T22:22:49Z

    [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite. 
Receiver data should be deserialized properly
    
    ## What changes were proposed in this pull request?
    
    Avoid to create multiple threads to stop StreamingContext. Otherwise, the 
latch added in #16091 can be passed too early.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16105 from zsxwing/SPARK-18617-2.
    
    (cherry picked from commit 086b0c8f6788b205bc630d5ccf078f77b9751af3)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 2d2e80180f3b746df9e45a49bc62da31a37dadb8
Author: Reynold Xin <[email protected]>
Date:   2016-12-02T01:58:28Z

    [SPARK-18639] Build only a single pip package
    
    ## What changes were proposed in this pull request?
    We current build 5 separate pip binary tar balls, doubling the release 
script runtime. It'd be better to build one, especially for use cases that are 
just using Spark locally. In the long run, it would make more sense to have 
Hadoop support be pluggable.
    
    ## How was this patch tested?
    N/A - this is a release build script that doesn't have any automated test 
coverage. We will know if it goes wrong when we prepare releases.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #16072 from rxin/SPARK-18639.
    
    (cherry picked from commit 37e52f8793bff306a7ae5a9aecc16f28333b70e3)
    Signed-off-by: Reynold Xin <[email protected]>

commit 2f91b0154ee0674b65e80f81f6498b94666c4b46
Author: sureshthalamati <[email protected]>
Date:   2016-12-02T03:13:38Z

    [SPARK-18141][SQL] Fix to quote column names in the predicate clause  of 
the JDBC RDD generated sql statement
    
    ## What changes were proposed in this pull request?
    
    SQL query generated for the JDBC data source is not quoting columns in the 
predicate clause. When the source table has quoted column names,  spark jdbc 
read fails with column not found error incorrectly.
    
    Error:
    org.h2.jdbc.JdbcSQLException: Column "ID" not found;
    Source SQL statement:
    SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE (Id < 1)
    
    This PR fixes by quoting column names in the generated  SQL for predicate 
clause  when filters are pushed down to the data source.
    
    Source SQL statement after the fix:
    SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE ("Id" < 1)
    
    ## How was this patch tested?
    
    Added new test case to the JdbcSuite
    
    Author: sureshthalamati <[email protected]>
    
    Closes #15662 from sureshthalamati/filter_quoted_cols-SPARK-18141.
    
    (cherry picked from commit 70c5549ee9588228d18a7b405c977cf591e2efd4)
    Signed-off-by: gatorsmile <[email protected]>

commit b9eb10043129defa53c5bdfd1190fe68c0107b3b
Author: gatorsmile <[email protected]>
Date:   2016-12-02T03:15:26Z

    [SPARK-18538][SQL][BACKPORT-2.1] Fix Concurrent Table Fetching Using 
DataFrameReader JDBC APIs
    
    ### What changes were proposed in this pull request?
    
    #### This PR is to backport https://github.com/apache/spark/pull/15975 to 
Branch 2.1
    
    ---
    
    The following two `DataFrameReader` JDBC APIs ignore the user-specified 
parameters of parallelism degree.
    
    ```Scala
      def jdbc(
          url: String,
          table: String,
          columnName: String,
          lowerBound: Long,
          upperBound: Long,
          numPartitions: Int,
          connectionProperties: Properties): DataFrame
    ```
    
    ```Scala
      def jdbc(
          url: String,
          table: String,
          predicates: Array[String],
          connectionProperties: Properties): DataFrame
    ```
    
    This PR is to fix the issues. To verify the behavior correctness, we 
improve the plan output of `EXPLAIN` command by adding `numPartitions` in the 
`JDBCRelation` node.
    
    Before the fix,
    ```
    == Physical Plan ==
    *Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema: 
struct<NAME:string,THEID:int>
    ```
    
    After the fix,
    ```
    == Physical Plan ==
    *Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897] 
ReadSchema: struct<NAME:string,THEID:int>
    ```
    ### How was this patch tested?
    Added the verification logics on all the test cases for JDBC concurrent 
fetching.
    
    Author: gatorsmile <[email protected]>
    
    Closes #16111 from gatorsmile/jdbcFix2.1.

commit fce1be6cc81b1fe3991a4df91128f4fcd14ff615
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-12-02T04:30:13Z

    [SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise
    
    ## What changes were proposed in this pull request?
    
    This PR makes `ExpressionEncoder.serializer.nullable` for flat encoder for 
a primitive type `false`. Since it is `true` for now, it is too conservative.
    While `ExpressionEncoder.schema` has correct information (e.g. 
`<IntegerType, false>`), `serializer.head.nullable` of `ExpressionEncoder`, 
which got from `encoderFor[T]`, is always false. It is too conservative.
    
    This is accomplished by checking whether a type is one of primitive types. 
If it is `true`, `nullable` should be `false`.
    
    ## How was this patch tested?
    
    Added new tests for encoder and dataframe
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #15780 from kiszk/SPARK-18284.
    
    (cherry picked from commit 38b9e69623c14a675b14639e8291f5d29d2a0bc3)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 0f0903d17b9c71a569d92f2c35e2caeb1eb8c89f
Author: Wenchen Fan <[email protected]>
Date:   2016-12-02T04:54:12Z

    [SPARK-18647][SQL] do not put provider in table properties for Hive serde 
table
    
    ## What changes were proposed in this pull request?
    
    In Spark 2.1, we make Hive serde tables case-preserving by putting the 
table metadata in table properties, like what we did for data source table. 
However, we should not put table provider, as it will break forward 
compatibility. e.g. if we create a Hive serde table with Spark 2.1, using 
`sql("create table test stored as parquet as select 1")`, we will fail to read 
it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table 
because there is a `provider` entry in table properties.
    
    Logically Hive serde table's provider is always hive, we don't need to 
store it in table properties, this PR removes it.
    
    ## How was this patch tested?
    
    manually test the forward compatibility issue.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #16080 from cloud-fan/hive.
    
    (cherry picked from commit a5f02b00291e0a22429a3dca81f12cf6d38fea0b)
    Signed-off-by: Wenchen Fan <[email protected]>

commit a7f8ebb8629706c54c286b7aca658838e718e804
Author: Cheng Lian <[email protected]>
Date:   2016-12-02T06:02:45Z

    [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary 
columns due to PARQUET-686
    
    This PR targets to both master and branch-2.1.
    
    ## What changes were proposed in this pull request?
    
    Due to PARQUET-686, Parquet doesn't do string comparison correctly while 
doing filter push-down for string columns. This PR disables filter push-down 
for both string and binary columns to work around this issue. Binary columns 
are also affected because some Parquet data models (like Hive) may store string 
columns as a plain Parquet `binary` instead of a `binary (UTF8)`.
    
    ## How was this patch tested?
    
    New test case added in `ParquetFilterSuite`.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #16106 from liancheng/spark-17213-bad-string-ppd.
    
    (cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258)
    Signed-off-by: Reynold Xin <[email protected]>

commit 65e896a6e9a5378f2d3a02c0c2a57fdb8d8f1d9d
Author: Eric Liang <[email protected]>
Date:   2016-12-02T12:59:39Z

    [SPARK-18679][SQL] Fix regression in file listing performance for 
non-catalog tables
    
    ## What changes were proposed in this pull request?
    
    In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed 
to InMemoryFileIndex). This introduced a regression where parallelism could 
only be introduced at the very top of the tree. However, in many cases (e.g. 
`spark.read.parquet(topLevelDir)`), the top of the tree is only a single 
directory.
    
    This PR simplifies and fixes the parallel recursive listing code to allow 
parallelism to be introduced at any level during recursive descent (though note 
that once we decide to list a sub-tree in parallel, the sub-tree is listed in 
serial on executors).
    
    cc mallman  cloud-fan
    
    ## How was this patch tested?
    
    Checked metrics in unit tests.
    
    Author: Eric Liang <[email protected]>
    
    Closes #16112 from ericl/spark-18679.
    
    (cherry picked from commit 294163ee9319e4f7f6da1259839eb3c80bba25c2)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 415730e19cea3a0e7ea5491bf801a22859bbab66
Author: Dongjoon Hyun <[email protected]>
Date:   2016-12-02T13:48:22Z

    [SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options
    
    ## What changes were proposed in this pull request?
    
    Currently, `JDBCRelation.insert` removes Spark options too early by 
mistakenly using `asConnectionProperties`. Spark options like `numPartitions` 
should be passed into `DataFrameWriter.jdbc` correctly. This bug have been 
**hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the 
mixed-case options. This PR aims to fix both.
    
    **JDBCRelation.insert**
    ```scala
    override def insert(data: DataFrame, overwrite: Boolean): Unit = {
      val url = jdbcOptions.url
      val table = jdbcOptions.table
    - val properties = jdbcOptions.asConnectionProperties
    + val properties = jdbcOptions.asProperties
      data.write
        .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
        .jdbc(url, table, properties)
    ```
    
    **JDBCOptions.asConnectionProperties**
    ```scala
    scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
    scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
    scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", 
"dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
    res0: java.util.Properties = {numpartitions=10}
    scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> 
"jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> 
"10"))).asConnectionProperties
    res1: java.util.Properties = {numpartitions=10}
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with a new testcase.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #15863 from dongjoon-hyun/SPARK-18419.
    
    (cherry picked from commit 55d528f2ba0ba689dbb881616d9436dc7958e943)
    Signed-off-by: Wenchen Fan <[email protected]>

commit e374b2426114d841e1935719f6e21919475f6804
Author: Eric Liang <[email protected]>
Date:   2016-12-02T13:59:02Z

    [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource 
tables
    
    ## What changes were proposed in this pull request?
    
    Two bugs are addressed here
    1. INSERT OVERWRITE TABLE sometime crashed when catalog partition 
management was enabled. This was because when dropping partitions after an 
overwrite operation, the Hive client will attempt to delete the partition 
files. If the entire partition directory was dropped, this would fail. The PR 
fixes this by adding a flag to control whether the Hive client should attempt 
to delete files.
    2. The static partition spec for OVERWRITE TABLE was not correctly resolved 
to the case-sensitive original partition names. This resulted in the entire 
table being overwritten if you did not correctly capitalize your partition 
names.
    
    cc yhuai cloud-fan
    
    ## How was this patch tested?
    
    Unit tests. Surprisingly, the existing overwrite table tests did not catch 
these edge cases.
    
    Author: Eric Liang <[email protected]>
    
    Closes #16088 from ericl/spark-18659.
    
    (cherry picked from commit 7935c8470c5c162ef7213e394fe8588e5dd42ca2)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 32c85383bfd6210e96b4bbcdedbe27a88935e4c7
Author: gatorsmile <[email protected]>
Date:   2016-12-02T14:12:19Z

    [SPARK-18674][SQL][FOLLOW-UP] improve the error message of using join
    
    ### What changes were proposed in this pull request?
    Added a test case for using joins with nested fields.
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #16110 from gatorsmile/followup-18674.
    
    (cherry picked from commit 2f8776ccad532fbed17381ff97d302007918b8d8)
    Signed-off-by: Wenchen Fan <[email protected]>

commit c69825a98989ee975dc8b87979e29e0fff15a3f7
Author: Ryan Blue <[email protected]>
Date:   2016-12-02T16:41:40Z

    [SPARK-18677] Fix parsing ['key'] in JSON path expressions.
    
    ## What changes were proposed in this pull request?
    
    This fixes the parser rule to match named expressions, which doesn't work 
for two reasons:
    1. The name match is not coerced to a regular expression (missing .r)
    2. The surrounding literals are incorrect and attempt to escape a single 
quote, which is unnecessary
    
    ## How was this patch tested?
    
    This adds test cases for named expressions using the bracket syntax, 
including one with quoted spaces.
    
    Author: Ryan Blue <[email protected]>
    
    Closes #16107 from rdblue/SPARK-18677-fix-json-path.
    
    (cherry picked from commit 48778976e0566d9c93a8c900825def82c6b81fd6)
    Signed-off-by: Herman van Hovell <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16748: [BACKPORT-2.1][SPARKR][DOCS] update R API doc for...

Reply via email to