[GitHub] spark pull request: [SPARK-14006] [SparkR] R style changes

rekhajoshm Fri, 18 Mar 2016 20:52:33 -0700

GitHub user rekhajoshm opened a pull request:

    https://github.com/apache/spark/pull/11837


    [SPARK-14006] [SparkR] R style changes

    ## What changes were proposed in this pull request?
    https://issues.apache.org/jira/browse/SPARK-14006
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rekhajoshm/spark brnach-1.6

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11837.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11837
    
----
commit 941b270b706d3b4aea73dbf102cfb6eee0beff63
Author: Dongjoon Hyun <[email protected]>
Date:   2016-03-03T22:42:12Z

    [MINOR] Fix typos in comments and testcase name of code
    
    ## What changes were proposed in this pull request?
    
    This PR fixes typos in comments and testcase name of code.
    
    ## How was this patch tested?
    
    manual.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.

commit 3edcc40223c8af12f64c2286420dc77817ab770e
Author: Andrew Or <[email protected]>
Date:   2016-03-03T23:24:38Z

    [SPARK-13632][SQL] Move commands.scala to command package
    
    ## What changes were proposed in this pull request?
    
    This patch simply moves things to a new package in an effort to reduce the 
size of the diff in #11048. Currently the new package only has one file, but in 
the future we'll add many new commands in SPARK-13139.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Andrew Or <[email protected]>
    
    Closes #11482 from andrewor14/commands-package.

commit ad0de99f3d3167990d501297f1df069fe15e0678
Author: Shixiong Zhu <[email protected]>
Date:   2016-03-03T23:41:56Z

    [SPARK-13584][SQL][TESTS] Make ContinuousQueryManagerSuite not output logs 
to the console
    
    ## What changes were proposed in this pull request?
    
    Make ContinuousQueryManagerSuite not output logs to the console. The logs 
will still output to `unit-tests.log`.
    
    I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid 
changing the log level which won't output logs to `unit-tests.log`.
    
    ## How was this patch tested?
    
    Just check Jenkins output.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #11439 from zsxwing/quietly-ContinuousQueryManagerSuite.

commit b373a888621ba6f0dd499f47093d4e2e42086dfc
Author: Davies Liu <[email protected]>
Date:   2016-03-04T01:36:48Z

    [SPARK-13415][SQL] Visualize subquery in SQL web UI
    
    ## What changes were proposed in this pull request?
    
    This PR support visualization for subquery in SQL web UI, also improve the 
explain of subquery, especially when it's used together with whole stage 
codegen.
    
    For example:
    ```python
    >>> sqlContext.range(100).registerTempTable("range")
    >>> sqlContext.sql("select id / (select sum(id) from range) from range 
where id > (select id from range limit 1)").explain(True)
    == Parsed Logical Plan ==
    'Project [unresolvedalias(('id / subquery#9), None)]
    :  +- 'SubqueryAlias subquery#9
    :     +- 'Project [unresolvedalias('sum('id), None)]
    :        +- 'UnresolvedRelation `range`, None
    +- 'Filter ('id > subquery#8)
       :  +- 'SubqueryAlias subquery#8
       :     +- 'GlobalLimit 1
       :        +- 'LocalLimit 1
       :           +- 'Project [unresolvedalias('id, None)]
       :              +- 'UnresolvedRelation `range`, None
       +- 'UnresolvedRelation `range`, None
    
    == Analyzed Logical Plan ==
    (id / scalarsubquery()): double
    Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / 
scalarsubquery())#11]
    :  +- SubqueryAlias subquery#9
    :     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS 
sum(id)#10L]
    :        +- SubqueryAlias range
    :           +- Range 0, 100, 1, 4, [id#0L]
    +- Filter (id#0L > subquery#8)
       :  +- SubqueryAlias subquery#8
       :     +- GlobalLimit 1
       :        +- LocalLimit 1
       :           +- Project [id#0L]
       :              +- SubqueryAlias range
       :                 +- Range 0, 100, 1, 4, [id#0L]
       +- SubqueryAlias range
          +- Range 0, 100, 1, 4, [id#0L]
    
    == Optimized Logical Plan ==
    Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / 
scalarsubquery())#11]
    :  +- SubqueryAlias subquery#9
    :     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS 
sum(id)#10L]
    :        +- Range 0, 100, 1, 4, [id#0L]
    +- Filter (id#0L > subquery#8)
       :  +- SubqueryAlias subquery#8
       :     +- GlobalLimit 1
       :        +- LocalLimit 1
       :           +- Project [id#0L]
       :              +- Range 0, 100, 1, 4, [id#0L]
       +- Range 0, 100, 1, 4, [id#0L]
    
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id 
/ scalarsubquery())#11]
    :     :  +- Subquery subquery#9
    :     :     +- WholeStageCodegen
    :     :        :  +- TungstenAggregate(key=[], 
functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
    :     :        :     +- INPUT
    :     :        +- Exchange SinglePartition, None
    :     :           +- WholeStageCodegen
    :     :              :  +- TungstenAggregate(key=[], 
functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
    :     :              :     +- Range 0, 1, 4, 100, [id#0L]
    :     +- Filter (id#0L > subquery#8)
    :        :  +- Subquery subquery#8
    :        :     +- CollectLimit 1
    :        :        +- WholeStageCodegen
    :        :           :  +- Project [id#0L]
    :        :           :     +- Range 0, 1, 4, 100, [id#0L]
    :        +- Range 0, 1, 4, 100, [id#0L]
    ```
    
    The web UI looks like:
    
    
![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)
    
    This PR also change the tree structure of WholeStageCodegen to make it 
consistent than others. Before this change, Both WholeStageCodegen and 
InputAdapter hold a references to the same plans, those could be updated 
without notify another, causing problems, this is discovered by #11403 .
    
    ## How was this patch tested?
    
    Existing tests, also manual tests with the example query, check the explain 
and web UI.
    
    Author: Davies Liu <[email protected]>
    
    Closes #11417 from davies/viz_subquery.

commit d062587dd2c4ed13998ee8bcc9d08f29734df228
Author: Davies Liu <[email protected]>
Date:   2016-03-04T01:46:28Z

    [SPARK-13601] [TESTS] use 1 partition in tests to avoid race conditions
    
    ## What changes were proposed in this pull request?
    
    Fix race conditions when cleanup files.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Davies Liu <[email protected]>
    
    Closes #11507 from davies/flaky.

commit 15d57f9c23145ace37d1631d8f9c19675c142214
Author: Wenchen Fan <[email protected]>
Date:   2016-03-04T04:16:37Z

    [SPARK-13647] [SQL] also check if numeric value is within allowed range in 
_verify_type
    
    ## What changes were proposed in this pull request?
    
    This PR makes the `_verify_type` in `types.py` more strict, also check if 
numeric value is within allowed range.
    
    ## How was this patch tested?
    
    newly added doc test.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #11492 from cloud-fan/py-verify.

commit f6ac7c30d48e666618466e825578fa457e2a0ed4
Author: thomastechs <[email protected]>
Date:   2016-03-04T04:35:40Z

    [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map 
string datatypes to Oracle VARCHAR datatype mapping
    
    ## What changes were proposed in this pull request?
    A test suite added for the bug fix -SPARK 12941; for the mapping of the 
StringType to corresponding in Oracle
    
    ## How was this patch tested?
    manual tests done
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Author: thomastechs <[email protected]>
    Author: THOMAS SEBASTIAN <[email protected]>
    
    Closes #11489 from thomastechs/thomastechs-12941-master-new.

commit 465c665db1dc65e3b02c584cf7f8d06b24909b0c
Author: Shixiong Zhu <[email protected]>
Date:   2016-03-04T06:53:07Z

    [SPARK-13652][CORE] Copy ByteBuffer in sendRpcSync as it will be recycled
    
    ## What changes were proposed in this pull request?
    
    `sendRpcSync` should copy the response content because the underlying 
buffer will be recycled and reused.
    
    ## How was this patch tested?
    
    Jenkins unit tests.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #11499 from zsxwing/SPARK-13652.

commit dd83c209f1692a2e5afb72fa7a2d039fd1e682c8
Author: Davies Liu <[email protected]>
Date:   2016-03-04T08:18:15Z

    [SPARK-13603][SQL] support SQL generation for subquery
    
    ## What changes were proposed in this pull request?
    
    This is support SQL generation for subquery expressions, which will be 
replaced to a SubqueryHolder inside SQLBuilder recursively.
    
    ## How was this patch tested?
    
    Added unit tests.
    
    Author: Davies Liu <[email protected]>
    
    Closes #11453 from davies/sql_subquery.

commit 27e88faa058c1364d0e99fffc0c5cb64ef817bd3
Author: Abou Haydar Elias <[email protected]>
Date:   2016-03-04T10:01:52Z

    [SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in getâ¦
    
    ## What changes were proposed in this pull request?
    
    It avoids counting the dataframe twice.
    
    Author: Abou Haydar Elias <[email protected]>
    Author: Elie A <[email protected]>
    
    Closes #11491 from eliasah/quantile-discretizer-patch.

commit c04dc27cedd3d75781fda4c24da16b6ada44d3e4
Author: Holden Karau <[email protected]>
Date:   2016-03-04T10:56:58Z

    [SPARK-13398][STREAMING] Move away from thread pool task support to forkjoin
    
    ## What changes were proposed in this pull request?
    
    Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext 
using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't 
give us a way to specify the thread pool name (and is a wrapper of Java's in 
2.12) except by providing a custom factory. Note that we can't use Java's 
ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which 
reports system parallelism. One other implicit change that happens is the old 
ExecutionContext would have reported a different default parallelism since it 
used system parallelism rather than threadpool parallelism (this was likely not 
intended but also likely not a huge difference).
    
    The previous version of this PR attempted to use an execution context 
constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class) 
so as to keep the ability to have human readable named threads but this 
reported system parallelism.
    
    ## How was this patch tested?
    
    unit tests: streaming/testOnly org.apache.spark.streaming.util.*
    
    Author: Holden Karau <[email protected]>
    
    Closes #11423 from 
holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.

commit 204b02b56afe358b7f2d403fb6e2b9e8a7122798
Author: Rajesh Balamohan <[email protected]>
Date:   2016-03-04T10:59:40Z

    [SPARK-12925] Improve HiveInspectors.unwrap for StringObjectInspector.â¦
    
    Earlier fix did not copy the bytes and it is possible for higher level to 
reuse Text object. This was causing issues. Proposed fix now copies the bytes 
from Text. This still avoids the expensive encoding/decoding
    
    Author: Rajesh Balamohan <[email protected]>
    
    Closes #11477 from rajeshbalamohan/SPARK-12925.2.

commit e617508244b508b59b4debb35cad3258cddbb9cf
Author: Masayoshi TSUZUKI <[email protected]>
Date:   2016-03-04T13:53:53Z

    [SPARK-13673][WINDOWS] Fixed not to pollute environment variables.
    
    ## What changes were proposed in this pull request?
    
    This patch fixes the problem that `bin\beeline.cmd` pollutes environment 
variables.
    The similar problem is reported and fixed in 
https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems 
to be added later.
    
    ## How was this patch tested?
    
    manual tests:
      I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% 
doesn't remain in the command prompt.
    
    Author: Masayoshi TSUZUKI <[email protected]>
    
    Closes #11516 from tsudukim/feature/SPARK-13673.

commit c8f25459ed4ad6b51a5f11665364cfe0b84f7b3c
Author: Dongjoon Hyun <[email protected]>
Date:   2016-03-04T16:25:41Z

    [SPARK-13676] Fix mismatched default values for regParam in 
LogisticRegression
    
    ## What changes were proposed in this pull request?
    
    The default value of regularization parameter for `LogisticRegression` 
algorithm is different in Scala and Python. We should provide the same value.
    
    **Scala**
    ```
    scala> new 
org.apache.spark.ml.classification.LogisticRegression().getRegParam
    res0: Double = 0.0
    ```
    
    **Python**
    ```
    >>> from pyspark.ml.classification import LogisticRegression
    >>> LogisticRegression().getRegParam()
    0.1
    ```
    
    ## How was this patch tested?
    manual. Check the following in `pyspark`.
    ```
    >>> from pyspark.ml.classification import LogisticRegression
    >>> LogisticRegression().getRegParam()
    0.0
    ```
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #11519 from dongjoon-hyun/SPARK-13676.

commit 83302c3bff13bd7734426c81d9c83bf4beb211c9
Author: Xusen Yin <[email protected]>
Date:   2016-03-04T16:32:24Z

    [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py
    
    Add save/load for feature.py. Meanwhile, add save/load for 
`ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in 
`VectorSlicer` and `StopWordsRemover`.
    
    In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala 
implementation is pending in https://github.com/apache/spark/pull/9884. I'll 
add them in this PR if https://github.com/apache/spark/pull/9884 gets merged 
first. Or add a follow-up JIRA for `RFormula`.
    
    Author: Xusen Yin <[email protected]>
    
    Closes #11203 from yinxusen/SPARK-13036.

commit b7d41474216787e9cd38c04a15c43d5d02f02f93
Author: Andrew Or <[email protected]>
Date:   2016-03-04T18:32:00Z

    [SPARK-13633][SQL] Move things into catalyst.parser package
    
    ## What changes were proposed in this pull request?
    
    This patch simply moves things to existing package 
`o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in 
#11048. This is conceptually the same as a recently merged patch #11482.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Andrew Or <[email protected]>
    
    Closes #11506 from andrewor14/parser-package.

commit 5f42c28b119b79c0ea4910c478853d451cd1a967
Author: Alex Bozarth <[email protected]>
Date:   2016-03-04T23:04:09Z

    [SPARK-13459][WEB UI] Separate Alive and Dead Executors in Executor Totals 
Table
    
    ## What changes were proposed in this pull request?
    
    Now that dead executors are shown in the executors table (#10058) the 
totals table is updated to include the separate totals for alive and dead 
executors as well as the current total, as originally discussed in #10668
    
    ## How was this patch tested?
    
    Manually verified by running the Standalone Web UI in the latest Safari and 
Firefox ESR
    
    Author: Alex Bozarth <[email protected]>
    
    Closes #11381 from ajbozarth/spark13459.

commit a6e2bd31f52f9e9452e52ab5b846de3dee8b98a7
Author: Nong Li <[email protected]>
Date:   2016-03-04T23:15:48Z

    [SPARK-13255] [SQL] Update vectorized reader to directly return 
ColumnarBatch instead of InternalRows.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    Currently, the parquet reader returns rows one by one which is bad for 
performance. This patch
    updates the reader to directly return ColumnarBatches. This is only enabled 
with whole stage
    codegen, which is the only operator currently that is able to consume 
ColumnarBatches (instead
    of rows). The current implementation is a bit of a hack to get this to work 
and we should do
    more refactoring of these low level interfaces to make this work better.
    
    ## How was this patch tested?
    
    ```
    Results:
    TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
---------------------------------------------------------------------------------
    q55 (before)                             8897 / 9265         12.9          
77.2
    q55                                      5486 / 5753         21.0          
47.6
    ```
    
    Author: Nong Li <[email protected]>
    
    Closes #11435 from nongli/spark-13255.

commit f19228eed89cf8e22a07a7ef7f37a5f6f8a3d455
Author: Jason White <[email protected]>
Date:   2016-03-05T00:04:56Z

    [SPARK-12073][STREAMING] backpressure rate controller consumes events 
preferentially from laggâ¦
    
    â¦ing partitions
    
    I'm pretty sure this is the reason we couldn't easily recover from an 
unbalanced Kafka partition under heavy load when using backpressure.
    
    `maxMessagesPerPartition` calculates an appropriate limit for the message 
rate from all partitions, and then divides by the number of partitions to 
determine how many messages to retrieve per partition. The problem with this 
approach is that when one partition is behind by millions of records (due to 
random Kafka issues), but the rate estimator calculates only 100k total 
messages can be retrieved, each partition (out of say 32) only retrieves max 
100k/32=3125 messages.
    
    This PR (still needing a test) determines a per-partition desired message 
count by using the current lag for each partition to preferentially weight the 
total message limit among the partitions. In this situation, if each partition 
gets 1k messages, but 1 partition starts 1M behind, then the total number of 
messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one 
partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k 
messages, and the other 31 partitions share the remaining 3%.
    
    Assuming all of 100k the messages are retrieved and processed within the 
batch window, the rate calculator will increase the number of messages to 
retrieve in the next batch, until it reaches a new stable point or the backlog 
is finished processed.
    
    We're going to try deploying this internally at Shopify to see if this 
resolves our issue.
    
    tdas koeninger holdenk
    
    Author: Jason White <[email protected]>
    
    Closes #10089 from JasonMWhite/rate_controller_offsets.

commit adce5ee721c6a844ff21dfcd8515859458fe611d
Author: gatorsmile <[email protected]>
Date:   2016-03-05T11:25:03Z

    [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping 
Sets
    
    #### What changes were proposed in this pull request?
    
    This PR is for supporting SQL generation for cube, rollup and grouping sets.
    
    For example, a query using rollup:
    ```SQL
    SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 
WITH ROLLUP
    ```
    Original logical plan:
    ```
      Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
                [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
                 (key#17L % cast(5 as bigint))#47L AS _c1#45L,
                 grouping__id#46 AS _c2#44]
      +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
                 List(key#17L, value#18, null, 1)],
                [key#17L,value#18,(key#17L % cast(5 as 
bigint))#47L,grouping__id#46]
         +- Project [key#17L,
                     value#18,
                     (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as 
bigint))#47L]
            +- Subquery t1
               +- Relation[key#17L,value#18] ParquetRelation
    ```
    Converted SQL:
    ```SQL
      SELECT count( 1) AS `cnt`,
             (`t1`.`key` % CAST(5 AS BIGINT)),
             grouping_id() AS `_c2`
      FROM `default`.`t1`
      GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
      GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
    ```
    
    #### How was the this patch tested?
    
    Added eight test cases in `LogicalPlanToSQLSuite`.
    
    Author: gatorsmile <[email protected]>
    Author: xiaoli <[email protected]>
    Author: Xiao Li <[email protected]>
    
    Closes #11283 from gatorsmile/groupingSetsToSQL.

commit 8290004d94760c22d6d3ca8dda3003ac8644422f
Author: Shixiong Zhu <[email protected]>
Date:   2016-03-05T23:26:27Z

    [SPARK-13693][STREAMING][TESTS] Stop StreamingContext before deleting 
checkpoint dir
    
    ## What changes were proposed in this pull request?
    
    Stop StreamingContext before deleting checkpoint dir to avoid the race 
condition that deleting the checkpoint dir and writing checkpoint happen at the 
same time.
    
    The flaky test log is here: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #11531 from zsxwing/SPARK-13693.

commit 8ff88094daa4945e7d718baa7b20703fd8087ab0
Author: Cheng Lian <[email protected]>
Date:   2016-03-06T04:54:04Z

    Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan without a 
project on top of it"
    
    This reverts commit f87ce0504ea0697969ac3e67690c78697b76e94a.
    
    According to discussion in #11466, let's revert PR #11466 for safe.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #11539 from liancheng/revert-pr-11466.

commit ee913e6e2d58dfac20f3f06ff306081bd0e48066
Author: Shixiong Zhu <[email protected]>
Date:   2016-03-06T16:57:01Z

    [SPARK-13697] [PYSPARK] Fix the missing module name of 
TransformFunctionSerializer.loads
    
    ## What changes were proposed in this pull request?
    
    Set the function's module name to `__main__` if it's missing in 
`TransformFunctionSerializer.loads`.
    
    ## How was this patch tested?
    
    Manually test in the shell.
    
    Before this patch:
    ```
    >>> from pyspark.streaming import StreamingContext
    >>> from pyspark.streaming.util import TransformFunction
    >>> ssc = StreamingContext(sc, 1)
    >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
    >>> func.rdd_wrapper(lambda x: x)
    TransformFunction(<function <lambda> at 0x106ac8b18>)
    >>> bytes = 
bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
func.rdd_wrap_func, func.deserializers)))
    >>> func2 = ssc._transformerSerializer.loads(bytes)
    >>> print(func2.func.__module__)
    None
    >>> print(func2.rdd_wrap_func.__module__)
    None
    >>>
    ```
    After this patch:
    ```
    >>> from pyspark.streaming import StreamingContext
    >>> from pyspark.streaming.util import TransformFunction
    >>> ssc = StreamingContext(sc, 1)
    >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
    >>> func.rdd_wrapper(lambda x: x)
    TransformFunction(<function <lambda> at 0x108bf1b90>)
    >>> bytes = 
bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
func.rdd_wrap_func, func.deserializers)))
    >>> func2 = ssc._transformerSerializer.loads(bytes)
    >>> print(func2.func.__module__)
    __main__
    >>> print(func2.rdd_wrap_func.__module__)
    __main__
    >>>
    ```
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #11535 from zsxwing/loads-module.

commit bc7a3ec290904f2d8802583bb0557bca1b8b01ff
Author: Andrew Or <[email protected]>
Date:   2016-03-07T08:14:40Z

    [SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog
    
    ## What changes were proposed in this pull request?
    
    Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the 
former will call the latter. When that happens, if both of them are still 
called `Catalog` it will be very confusing. This patch renames the latter 
`ExternalCatalog` because it is expected to talk to external systems.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Andrew Or <[email protected]>
    
    Closes #11526 from andrewor14/rename-catalog.

commit 4b13896ebf7cecf9d50514a62165b612ee18124a
Author: rmishra <[email protected]>
Date:   2016-03-07T09:55:49Z

    [SPARK-13705][DOCS] UpdateStateByKey Operation documentation incorrectly 
refers to StatefulNetworkWordCount
    
    ## What changes were proposed in this pull request?
    The reference to StatefulNetworkWordCount.scala from updateStatesByKey 
documentation should be removed, till there is a example for updateStatesByKey.
    
    ## How was this patch tested?
    Have tested the new documentation with jekyll build.
    
    Author: rmishra <[email protected]>
    
    Closes #11545 from rishitesh/SPARK-13705.

commit 03f57a6c2dd6ffd4038ca9cecbfc221deaf52393
Author: Yury Liavitski <[email protected]>
Date:   2016-03-07T10:54:33Z

    Fixing the type of the sentiment happiness value
    
    ## What changes were proposed in this pull request?
    
    Added the conversion to int for the 'happiness value' read from the file. 
Otherwise, later on line 75 the multiplication will multiply a string by a 
number, yielding values like "-2-2" instead of -4.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Yury Liavitski <[email protected]>
    Author: Yury Liavitski <[email protected]>
    
    Closes #11540 from heliocentrist/fix-sentiment-value-type.

commit d7eac9d7951c19302ed41fe03eaa38394aeb9c1a
Author: Dilip Biswal <[email protected]>
Date:   2016-03-07T17:46:28Z

    [SPARK-13651] Generator outputs are not resolved correctly resulting in run 
time error
    
    ## What changes were proposed in this pull request?
    
    ```
    Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
    sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 
'key2', 200)) t1 AS key, value")
    ```
    Results in following logical plan
    
    ```
    Project [key#2,value#3]
    +- Generate 
explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)),
 true, false, Some(genoutput), [key#2,value#3]
       +- SubqueryAlias src
          +- Project [_1#0 AS key#2,_2#1 AS value#3]
             +- LocalRelation [_1#0,_2#1], [[id1,value1]]
    ```
    
    The above query fails with following runtime error.
    ```
    java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
        at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
        at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
        at 
org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
        at 
org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
            <stack-trace omitted.....>
    ```
    In this case the generated outputs are wrongly resolved from its child 
(LocalRelation) due to
    
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    Added unit tests in hive/SQLQuerySuite and AnalysisSuite
    
    Author: Dilip Biswal <[email protected]>
    
    Closes #11497 from dilipbiswal/spark-13651.

commit 489641117651d11806d2773b7ded7c163d0260e5
Author: Wenchen Fan <[email protected]>
Date:   2016-03-07T18:32:34Z

    [SPARK-13694][SQL] QueryPlan.expressions should always include all 
expressions
    
    ## What changes were proposed in this pull request?
    
    It's weird that expressions don't always have all the expressions in it. 
This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it 
to exclude some expressions. Currently only `Generate` override it, we can use 
`producedAttributes` to fix the unresolved attribute problem for it.
    
    Note that this PR doesn't fix the problem in #11497
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #11532 from cloud-fan/generate.

commit ef77003178eb5cdcb4fe519fc540917656c5d577
Author: Sameer Agarwal <[email protected]>
Date:   2016-03-07T20:04:59Z

    [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins 
based on their data constraints
    
    ## What changes were proposed in this pull request?
    
    This PR adds an optimizer rule to eliminate reading (unnecessary) NULL 
values if they are not required for correctness by inserting `isNotNull` 
filters is the query plan. These filters are currently inserted beneath 
existing `Filter` and `Join` operators and are inferred based on their data 
constraints.
    
    Note: While this optimization is applicable to all types of join, it 
primarily benefits `Inner` and `LeftSemi` joins.
    
    ## How was this patch tested?
    
    1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in 
the query plan for joins and filters. Also, tests interaction with the 
`CombineFilters` optimizer rules.
    2. Test generated ExpressionTrees via `OrcFilterSuite`
    3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`
    
    cc yhuai nongli
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #11372 from sameeragarwal/gen-isnotnull.

commit e72914f37de85519fc2aa131bac69d7582de98c8
Author: Dongjoon Hyun <[email protected]>
Date:   2016-03-07T20:06:46Z

    [SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.
    
    ## What changes were proposed in this pull request?
    
    In the Jenkins pull request builder, PySpark tests take around [962 seconds 
](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console)
 of end-to-end time to run, despite the fact that we run four Python test 
suites in parallel. According to the log, the basic reason is that the long 
running test starts at the end due to FIFO queue. We first try to reduce the 
test time by just starting some long running tests first with simple priority 
queue.
    
    ```
    ========================================================================
    Running PySpark tests
    ========================================================================
    ...
    Finished test(python3.4): pyspark.streaming.tests (213s)
    Finished test(pypy): pyspark.sql.tests (92s)
    Finished test(pypy): pyspark.streaming.tests (280s)
    Tests passed in 962 seconds
    ```
    
    ## How was this patch tested?
    
    Manual check.
    Check 'Running PySpark tests' part of the Jenkins log.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #11551 from dongjoon-hyun/SPARK-12243.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14006] [SparkR] R style changes

Reply via email to