[GitHub] spark pull request: [SPARK-8536][MLlib]Generalize OnlineLDAOptimiz...

feynmanliang Tue, 21 Jul 2015 13:20:11 -0700

GitHub user feynmanliang reopened a pull request:

    https://github.com/apache/spark/pull/7307


    [SPARK-8536][MLlib]Generalize OnlineLDAOptimizer to asymmetric 
document-topic Dirichlet priors

    Modify `LDA` to take asymmetric document-topic prior distributions and 
`OnlineLDAOptimizer` to use the asymmetric prior during variational inference.
    
    This PR only generalizes `OnlineLDAOptimizer` and the associated 
`LocalLDAModel`; `EMLDAOptimizer` and `DistributedLDAModel` still only support 
symmetric `alpha` (checked during `EMLDAOptimizer.initialize`).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/feynmanliang/spark 
SPARK-8536-LDA-asymmetric-priors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7307.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7307
    
----
commit 3c90f3de7f0f18195fa62c49ff4052d304b2dc92
Author: Feynman Liang <[email protected]>
Date:   2015-07-09T01:31:24Z

    Generalize OnlineLDA to asymmetric priors, no tests

commit 28fa01e2ba146e823489f6d81c5eb3a76b20c71f
Author: Jonathan Alter <[email protected]>
Date:   2015-07-09T02:28:51Z

    [SPARK-8927] [DOCS] Format wrong for some config descriptions
    
    A couple descriptions were not inside `<td></td>` and were being displayed 
immediately under the section title instead of in their row.
    
    Author: Jonathan Alter <[email protected]>
    
    Closes #7292 from jonalter/docs-config and squashes the following commits:
    
    5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions

commit a290814877308c6fa9b0f78b1a81145db7651ca4
Author: Yijie Shen <[email protected]>
Date:   2015-07-09T03:20:17Z

    [SPARK-8866][SQL] use 1us precision for timestamp type
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-8866
    
    Author: Yijie Shen <[email protected]>
    
    Closes #7283 from yijieshen/micro_timestamp and squashes the following 
commits:
    
    dc735df [Yijie Shen] update CastSuite to avoid round error
    714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
    c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
    8d4aa6b [Yijie Shen] use 1us precision for timestamp type

commit b55499a44ab74e33378211fb0d6940905d7c6318
Author: Josh Rosen <[email protected]>
Date:   2015-07-09T03:28:05Z

    [SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPools
    
    We call Row.copy() in many places throughout SQL but UnsafeRow currently 
throws UnsupportedOperationException when copy() is called.
    
    Supporting copying when ObjectPool is used may be difficult, since we may 
need to handle deep-copying of objects in the pool. In addition, this copy() 
method needs to produce a self-contained row object which may be passed around 
/ buffered by downstream code which does not understand the UnsafeRow format.
    
    In the long run, we'll need to figure out how to handle the ObjectPool 
corner cases, but this may be unnecessary if other changes are made. Therefore, 
in order to unblock my sort patch (#6444) I propose that we support copy() for 
the cases where UnsafeRow does not use an ObjectPool and continue to throw 
UnsupportedOperationException when an ObjectPool is used.
    
    This patch accomplishes this by modifying UnsafeRow so that it knows the 
size of the row's backing data in order to be able to copy it into a byte array.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits:
    
    338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use 
ObjectPools.

commit 47ef423f860c3109d50c7e321616b267f4296e34
Author: Andrew Or <[email protected]>
Date:   2015-07-09T03:29:08Z

    [SPARK-8910] Fix MiMa flaky due to port contention issue
    
    Due to the way MiMa works, we currently start a `SQLContext` pretty early 
on. This causes us to start a `SparkUI` that attempts to bind to port 4040. 
Because many tests run in parallel on the Jenkins machines, this  causes port 
contention sometimes and fails the MiMa tests.
    
    Note that we already disabled the SparkUI for scalatests. However, the MiMa 
test is run before we even have a chance to load the default scalatest 
settings, so we need to explicitly disable the UI ourselves.
    
    Author: Andrew Or <[email protected]>
    
    Closes #7300 from andrewor14/mima-flaky and squashes the following commits:
    
    b55a547 [Andrew Or] Do not enable SparkUI during tests

commit aba5784dab24c03ddad89f7a1b5d3d0dc8d109be
Author: Kousuke Saruta <[email protected]>
Date:   2015-07-09T04:28:17Z

    [SPARK-8937] [TEST] A setting `spark.unsafe.exceptionOnMemoryLeak ` is 
missing in ScalaTest config.
    
    `spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire.
    
    ```
            <!-- Surefire runs all Java tests -->
            <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-surefire-plugin</artifactId>
              <version>2.18.1</version>
              <!-- Note config is repeated in scalatest config -->
    ...
    
    
<spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak>
                </systemProperties>
    ...
    ```
    
     but is absent in the config ScalaTest.
    
    Author: Kousuke Saruta <[email protected]>
    
    Closes #7308 from sarutak/add-setting-for-memory-leak and squashes the 
following commits:
    
    95644e7 [Kousuke Saruta] Added a setting for memory leak

commit 768907eb7b0d3c11a420ef281454e36167011c89
Author: Michael Armbrust <[email protected]>
Date:   2015-07-09T05:05:58Z

    [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
    
    For example: `cannot resolve 'testfunction(null)' due to data type 
mismatch: argument 1 is expected to be of type int, however, null is of type 
datetype.`
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #7303 from marmbrus/expectsTypeErrors and squashes the following 
commits:
    
    c654a0e [Michael Armbrust] fix udts and make errors pretty
    137160d [Michael Armbrust] style
    5428fda [Michael Armbrust] style
    10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for 
ExpectsInputType expressions

commit a240bf3b44b15d0da5182d6ebec281dbdc5439e8
Author: Reynold Xin <[email protected]>
Date:   2015-07-09T05:08:50Z

    Closes #7310.

commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613
Author: Cheng Lian <[email protected]>
Date:   2015-07-09T05:09:12Z

    [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when 
handling Parquet LISTs in compatible mode
    
    This PR is based on #7209 authored by Sephiroth-Lin.
    
    Author: Weizhong Lin <[email protected]>
    
    Closes #7304 from liancheng/spark-8928 and squashes the following commits:
    
    75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when 
handling LISTs in compatible mode

commit c056484c0741e2a03d4a916538e1b9e3e65e71c3
Author: Cheng Lian <[email protected]>
Date:   2015-07-09T05:14:38Z

    Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- 
when handling Parquet LISTs in compatible mode"
    
    This reverts commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613.

commit 851e247caad0977cfd4998254d9602624e06539f
Author: Weizhong Lin <[email protected]>
Date:   2015-07-09T05:18:39Z

    [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when 
handling Parquet LISTs in compatible mode
    
    This PR is based on #7209 authored by Sephiroth-Lin.
    
    Author: Weizhong Lin <[email protected]>
    
    Closes #7314 from liancheng/spark-8928 and squashes the following commits:
    
    75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when 
handling LISTs in compatible mode

commit 09cb0d9c2dcb83818ced22ff9bd6a51688ea7ffe
Author: Wenchen Fan <[email protected]>
Date:   2015-07-09T07:26:25Z

    [SPARK-8942][SQL] use double not decimal when cast double and float to 
timestamp
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #7312 from cloud-fan/minor and squashes the following commits:
    
    a4589fa [Wenchen Fan] use double not decimal when cast double and float to 
timestamp

commit f88b12537ee81d914ef7c51a08f80cb28d93c8ed
Author: lewuathe <[email protected]>
Date:   2015-07-09T15:16:26Z

    [SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices, 
values
    
    Write missing pydocs in `SparseVector` attributes.
    
    Author: lewuathe <[email protected]>
    
    Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits:
    
    51d9895 [lewuathe] Update docs
    0480d35 [lewuathe] Merge branch 'master' into SPARK-6266
    ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, 
indices, values

commit 23448a9e988a1b92bd05ee8c6c1a096c83375a12
Author: Davies Liu <[email protected]>
Date:   2015-07-09T16:20:16Z

    [SPARK-8931] [SQL] Fallback to interpreted evaluation if failed to compile 
in codegen
    
    Exception will not be catched during tests.
    
    cc marmbrus rxin
    
    Author: Davies Liu <[email protected]>
    
    Closes #7309 from davies/fallback and squashes the following commits:
    
    969a612 [Davies Liu] throw exception during tests
    f844f77 [Davies Liu] fallback
    a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fallback
    364a0d6 [Davies Liu] fallback to interpret mode if failed to compile

commit a1964e9d902bb31f001893da8bc81f6dce08c908
Author: Tarek Auel <[email protected]>
Date:   2015-07-09T16:22:24Z

    [SPARK-8830] [SQL] native levenshtein distance
    
    Jira: https://issues.apache.org/jira/browse/SPARK-8830
    
    rxin and HuJiayin can you have a look on it.
    
    Author: Tarek Auel <[email protected]>
    
    Closes #7236 from tarekauel/native-levenshtein-distance and squashes the 
following commits:
    
    ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals
    c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for 
byte array comparison
    ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance
    179920a [Tarek Auel] [SPARK-8830] added description
    5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import
    dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance

commit 59cc38944fe5c1dffc6551775bd939e2ac66c65e
Author: Liang-Chi Hsieh <[email protected]>
Date:   2015-07-09T16:57:12Z

    [SPARK-8940] [SPARKR] Don't overwrite given schema in createDataFrame
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-8940
    
    The given `schema` parameter will be overwritten in `createDataFrame` now. 
If it is not null, we shouldn't overwrite it.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #7311 from viirya/df_not_overwrite_schema and squashes the following 
commits:
    
    2385139 [Liang-Chi Hsieh] Don't overwrite given schema if it is not null.

commit e204d22bb70f28b1cc090ab60f12078479be4ae0
Author: Reynold Xin <[email protected]>
Date:   2015-07-09T17:01:01Z

    [SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract class
    
    Also added more documentation for the file.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #7316 from rxin/extract-value and squashes the following commits:
    
    069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal.
    621b705 [Reynold Xin] Reverted a line.
    11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type 
extractors.

commit a870a82fb6f57bb63bd6f1e95da944a30f67519a
Author: Reynold Xin <[email protected]>
Date:   2015-07-09T17:01:33Z

    [SPARK-8926][SQL] Code review followup.
    
    I merged https://github.com/apache/spark/pull/7303 so it unblocks another 
PR. This addresses my own code review comment for that PR.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #7313 from rxin/adt and squashes the following commits:
    
    7ade82b [Reynold Xin] Fixed unit tests.
    f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.

commit f6c0bd5c3755b2f9bab633a5d478240fdaf1c593
Author: Wenchen Fan <[email protected]>
Date:   2015-07-09T17:04:42Z

    [SPARK-8938][SQL] Implement toString for Interval data type
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #7315 from cloud-fan/toString and squashes the following commits:
    
    4fc8d80 [Wenchen Fan] Implement toString for Interval data type

commit c59e268d17cf10e46dbdbe760e2a7580a6364692
Author: JPark <[email protected]>
Date:   2015-07-09T17:23:36Z

    [SPARK-8863] [EC2] Check aws access key from aws credentials if there is no 
boto config
    
    'spark_ec2.py' use boto to control ec2.
    And boto can support '~/.aws/credentials' which is AWS CLI default 
configuration file.
    
    We can check this information from ref of boto.
    
    "A boto config file is a text file formatted like an .ini configuration 
file that specifies values for options that control the behavior of the boto 
library. In Unix/Linux systems, on startup, the boto library looks for 
configuration files in the following locations and in the following order:
    /etc/boto.cfg - for site-wide settings that all users on this machine will 
use
    (if profile is given) ~/.aws/credentials - for credentials shared between 
SDKs
    (if profile is given) ~/.boto - for user-specific settings
    ~/.aws/credentials - for credentials shared between SDKs
    ~/.boto - for user-specific settings"
    
    * ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html
    * ref of aws cli : 
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
    
    However 'spark_ec2.py' only check boto config & environment variable even 
if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated.
    
    So I changed to check '~/.aws/credentials'.
    
    cc rxin
    
    Jira : https://issues.apache.org/jira/browse/SPARK-8863
    
    Author: JPark <[email protected]>
    
    Closes #7252 from JuhongPark/master and squashes the following commits:
    
    23c5792 [JPark] Check aws access key from aws credentials if there is no 
boto config

commit 0cd84c86cac68600a74d84e50ad40c0c8b84822a
Author: Yuhao Yang <[email protected]>
Date:   2015-07-09T17:26:38Z

    [SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert 
document to words count vector
    
    jira: https://issues.apache.org/jira/browse/SPARK-8703
    
    Converts a text document to a sparse vector of token counts.
    
    I can further add an estimator to extract vocabulary from corpus if that's 
appropriate.
    
    Author: Yuhao Yang <[email protected]>
    
    Closes #7084 from hhbyyh/countVectorization and squashes the following 
commits:
    
    5f3f655 [Yuhao Yang] text change
    24728e4 [Yuhao Yang] style improvement
    576728a [Yuhao Yang] rename to model and some fix
    1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into 
countVectorization
    99b0c14 [Yuhao Yang] undo extension from HashingTF
    12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into 
countVectorization
    7ee1c31 [Yuhao Yang] extends HashingTF
    809fb59 [Yuhao Yang] minor fix for ut
    7c61fb3 [Yuhao Yang] add countVectorizer

commit 0b0b9ceaf73de472198c9804fb7ae61fa2a2e097
Author: Cheng Hao <[email protected]>
Date:   2015-07-09T18:11:34Z

    [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] 
[SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] 
[SPARK-8267] [SQL] Add String Expressions
    
    Author: Cheng Hao <[email protected]>
    
    Closes #6762 from chenghao-intel/str_funcs and squashes the following 
commits:
    
    b09a909 [Cheng Hao] update the code as feedback
    7ebbf4c [Cheng Hao] Add more string expressions

commit 7ce3b818fb1ba3f291eda58988e4808e999cae3a
Author: Tathagata Das <[email protected]>
Date:   2015-07-09T20:19:36Z

    [MINOR] [STREAMING] Fix log statements in ReceiverSupervisorImpl
    
    Log statements incorrectly showed that the executor was being stopped when 
receiver was being stopped.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #7328 from tdas/fix-log and squashes the following commits:
    
    9cc6e99 [Tathagata Das] Fix log statements.

commit 930fe95350f8865e2af2d7afa5b717210933cd43
Author: xutingjun <[email protected]>
Date:   2015-07-09T20:21:10Z

    [SPARK-8953] SPARK_EXECUTOR_CORES is not read in SparkSubmit
    
    The configuration ```SPARK_EXECUTOR_CORES``` won't put into 
```SparkConf```, so it has no effect to the dynamic executor allocation.
    
    Author: xutingjun <[email protected]>
    
    Closes #7322 from XuTingjun/SPARK_EXECUTOR_CORES and squashes the following 
commits:
    
    2cafa89 [xutingjun] make SPARK_EXECUTOR_CORES has effect to 
dynamicAllocation

commit 88bf430331eef3c02438ca441616034486e15789
Author: zsxwing <[email protected]>
Date:   2015-07-09T20:22:17Z

    [SPARK-7419] [STREAMING] [TESTS] Fix CheckpointSuite.recovery with file 
input stream
    
    Fix this failure: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2886/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_with_file_input_stream/
    
    To reproduce this failure, you can add `Thread.sleep(2000)` before this line
    
https://github.com/apache/spark/blob/a9c4e29950a14e32acaac547e9a0e8879fd37fc9/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala#L477
    
    Author: zsxwing <[email protected]>
    
    Closes #7323 from zsxwing/SPARK-7419 and squashes the following commits:
    
    b3caf58 [zsxwing] Fix CheckpointSuite.recovery with file input stream

commit ebdf58538058e57381c04b6725d4be0c37847ed3
Author: Andrew Or <[email protected]>
Date:   2015-07-09T20:25:11Z

    [SPARK-2017] [UI] Stage page hangs with many tasks
    
    (This reopens a patch that was closed in the past: #6248)
    
    When you view the stage page while running the following:
    ```
    sc.parallelize(1 to X, 10000).count()
    ```
    The page never loads, the job is stalled, and you end up running into an 
OOM:
    ```
    HTTP ERROR 500
    
    Problem accessing /stages/stage/. Reason:
        Server Error
    Caused by:
    java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    ```
    This patch compresses Jetty responses in gzip. The correct long-term fix is 
to add pagination.
    
    Author: Andrew Or <[email protected]>
    
    Closes #7296 from andrewor14/gzip-jetty and squashes the following commits:
    
    a051c64 [Andrew Or] Use GZIP to compress Jetty responses

commit c4830598b271cc6390d127bd4cf8ab02b28792e0
Author: Iulian Dragos <[email protected]>
Date:   2015-07-09T20:26:46Z

    [SPARK-6287] [MESOS] Add dynamic allocation to the coarse-grained Mesos 
scheduler
    
    This is largely based on extracting the dynamic allocation parts from 
tnachen's #3861.
    
    Author: Iulian Dragos <[email protected]>
    
    Closes #4984 from dragos/issue/mesos-coarse-dynamicAllocation and squashes 
the following commits:
    
    39df8cd [Iulian Dragos] Update tests to latest changes in core.
    9d2c9fa [Iulian Dragos] Remove adjustment of executorLimitOption in 
doKillExecutors.
    8b00f52 [Iulian Dragos] Latest round of reviews.
    0cd00e0 [Iulian Dragos] Add persistent shuffle directory
    15c45c1 [Iulian Dragos] Add dynamic allocation to the Spark coarse-grained 
scheduler.

commit 1f6b0b1234cc03aa2e07aea7fec2de7563885238
Author: zsxwing <[email protected]>
Date:   2015-07-09T20:48:29Z

    [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page
    
    This PR adds `metadata` to `InputInfo`. `InputDStream` can report its 
metadata for a batch and it will be shown in the batch page.
    
    For example,
    
    ![screen 
shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)
    
    FileInputDStream will display the new files for a batch, and 
DirectKafkaInputDStream will display its offset ranges.
    
    Author: zsxwing <[email protected]>
    
    Closes #7081 from zsxwing/input-metadata and squashes the following commits:
    
    f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
    d906209 [zsxwing] Merge branch 'master' into input-metadata
    74762da [zsxwing] Fix MiMa tests
    7903e33 [zsxwing] Merge branch 'master' into input-metadata
    450a46c [zsxwing] Address comments
    1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" 
to Map[String, Any]
    d496ae9 [zsxwing] Add input metadata in the batch page

commit 3ccebf36c5abe04702d4cf223552a94034d980fb
Author: jerryshao <[email protected]>
Date:   2015-07-09T20:54:44Z

    [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python
    
    This PR propose a simple way to expose OffsetRange in Python code, also the 
usage of offsetRanges is similar to Scala/Java way, here in Python we could get 
OffsetRange like:
    
    ```
    dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
    ```
    
    Reason I didn't follow the way what SPARK-8389 suggested is that: Python 
Kafka API has one more step to decode the message compared to Scala/Java, Which 
makes Python API return a transformed RDD/DStream, not directly wrapped 
so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get 
the offsetRange.
    
    Author: jerryshao <[email protected]>
    
    Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:
    
    4c6d320 [jerryshao] Another way to fix subclass deserialization issue
    e6a8011 [jerryshao] Address the comments
    fd13937 [jerryshao] Fix serialization bug
    7debf1c [jerryshao] bug fix
    cff3893 [jerryshao] refactor the code according to the comments
    2aabf9e [jerryshao] Style fix
    848c708 [jerryshao] Add HasOffsetRanges for Python

commit c9e2ef52bb54f35a904427389dc492d61f29b018
Author: Davies Liu <[email protected]>
Date:   2015-07-09T21:43:38Z

    [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of 
serialization for Python DataFrame
    
    This PR fix the long standing issue of serialization between Python RDD and 
DataFrame, it change to using a customized Pickler for InternalRow to enable 
customized unpickling (type conversion, especially for UDT), now we can support 
UDT for UDF, cc mengxr .
    
    There is no generated `Row` anymore.
    
    Author: Davies Liu <[email protected]>
    
    Closes #7301 from davies/sql_ser and squashes the following commits:
    
    81bef71 [Davies Liu] address comments
    e9217bd [Davies Liu] add regression tests
    db34167 [Davies Liu] Refactor of serialization for Python DataFrame

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8536][MLlib]Generalize OnlineLDAOptimiz...

Reply via email to