GitHub user feynmanliang reopened a pull request:
https://github.com/apache/spark/pull/7307
[SPARK-8536][MLlib]Generalize OnlineLDAOptimizer to asymmetric
document-topic Dirichlet priors
Modify `LDA` to take asymmetric document-topic prior distributions and
`OnlineLDAOptimizer` to use the asymmetric prior during variational inference.
This PR only generalizes `OnlineLDAOptimizer` and the associated
`LocalLDAModel`; `EMLDAOptimizer` and `DistributedLDAModel` still only support
symmetric `alpha` (checked during `EMLDAOptimizer.initialize`).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/feynmanliang/spark
SPARK-8536-LDA-asymmetric-priors
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7307.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7307
----
commit 3c90f3de7f0f18195fa62c49ff4052d304b2dc92
Author: Feynman Liang <[email protected]>
Date: 2015-07-09T01:31:24Z
Generalize OnlineLDA to asymmetric priors, no tests
commit 28fa01e2ba146e823489f6d81c5eb3a76b20c71f
Author: Jonathan Alter <[email protected]>
Date: 2015-07-09T02:28:51Z
[SPARK-8927] [DOCS] Format wrong for some config descriptions
A couple descriptions were not inside `<td></td>` and were being displayed
immediately under the section title instead of in their row.
Author: Jonathan Alter <[email protected]>
Closes #7292 from jonalter/docs-config and squashes the following commits:
5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
commit a290814877308c6fa9b0f78b1a81145db7651ca4
Author: Yijie Shen <[email protected]>
Date: 2015-07-09T03:20:17Z
[SPARK-8866][SQL] use 1us precision for timestamp type
JIRA: https://issues.apache.org/jira/browse/SPARK-8866
Author: Yijie Shen <[email protected]>
Closes #7283 from yijieshen/micro_timestamp and squashes the following
commits:
dc735df [Yijie Shen] update CastSuite to avoid round error
714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
8d4aa6b [Yijie Shen] use 1us precision for timestamp type
commit b55499a44ab74e33378211fb0d6940905d7c6318
Author: Josh Rosen <[email protected]>
Date: 2015-07-09T03:28:05Z
[SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPools
We call Row.copy() in many places throughout SQL but UnsafeRow currently
throws UnsupportedOperationException when copy() is called.
Supporting copying when ObjectPool is used may be difficult, since we may
need to handle deep-copying of objects in the pool. In addition, this copy()
method needs to produce a self-contained row object which may be passed around
/ buffered by downstream code which does not understand the UnsafeRow format.
In the long run, we'll need to figure out how to handle the ObjectPool
corner cases, but this may be unnecessary if other changes are made. Therefore,
in order to unblock my sort patch (#6444) I propose that we support copy() for
the cases where UnsafeRow does not use an ObjectPool and continue to throw
UnsupportedOperationException when an ObjectPool is used.
This patch accomplishes this by modifying UnsafeRow so that it knows the
size of the row's backing data in order to be able to copy it into a byte array.
Author: Josh Rosen <[email protected]>
Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits:
338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use
ObjectPools.
commit 47ef423f860c3109d50c7e321616b267f4296e34
Author: Andrew Or <[email protected]>
Date: 2015-07-09T03:29:08Z
[SPARK-8910] Fix MiMa flaky due to port contention issue
Due to the way MiMa works, we currently start a `SQLContext` pretty early
on. This causes us to start a `SparkUI` that attempts to bind to port 4040.
Because many tests run in parallel on the Jenkins machines, this causes port
contention sometimes and fails the MiMa tests.
Note that we already disabled the SparkUI for scalatests. However, the MiMa
test is run before we even have a chance to load the default scalatest
settings, so we need to explicitly disable the UI ourselves.
Author: Andrew Or <[email protected]>
Closes #7300 from andrewor14/mima-flaky and squashes the following commits:
b55a547 [Andrew Or] Do not enable SparkUI during tests
commit aba5784dab24c03ddad89f7a1b5d3d0dc8d109be
Author: Kousuke Saruta <[email protected]>
Date: 2015-07-09T04:28:17Z
[SPARK-8937] [TEST] A setting `spark.unsafe.exceptionOnMemoryLeak ` is
missing in ScalaTest config.
`spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire.
```
<!-- Surefire runs all Java tests -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<!-- Note config is repeated in scalatest config -->
...
<spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak>
</systemProperties>
...
```
but is absent in the config ScalaTest.
Author: Kousuke Saruta <[email protected]>
Closes #7308 from sarutak/add-setting-for-memory-leak and squashes the
following commits:
95644e7 [Kousuke Saruta] Added a setting for memory leak
commit 768907eb7b0d3c11a420ef281454e36167011c89
Author: Michael Armbrust <[email protected]>
Date: 2015-07-09T05:05:58Z
[SPARK-8926][SQL] Good errors for ExpectsInputType expressions
For example: `cannot resolve 'testfunction(null)' due to data type
mismatch: argument 1 is expected to be of type int, however, null is of type
datetype.`
Author: Michael Armbrust <[email protected]>
Closes #7303 from marmbrus/expectsTypeErrors and squashes the following
commits:
c654a0e [Michael Armbrust] fix udts and make errors pretty
137160d [Michael Armbrust] style
5428fda [Michael Armbrust] style
10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for
ExpectsInputType expressions
commit a240bf3b44b15d0da5182d6ebec281dbdc5439e8
Author: Reynold Xin <[email protected]>
Date: 2015-07-09T05:08:50Z
Closes #7310.
commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613
Author: Cheng Lian <[email protected]>
Date: 2015-07-09T05:09:12Z
[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when
handling Parquet LISTs in compatible mode
This PR is based on #7209 authored by Sephiroth-Lin.
Author: Weizhong Lin <[email protected]>
Closes #7304 from liancheng/spark-8928 and squashes the following commits:
75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when
handling LISTs in compatible mode
commit c056484c0741e2a03d4a916538e1b9e3e65e71c3
Author: Cheng Lian <[email protected]>
Date: 2015-07-09T05:14:38Z
Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x-
when handling Parquet LISTs in compatible mode"
This reverts commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613.
commit 851e247caad0977cfd4998254d9602624e06539f
Author: Weizhong Lin <[email protected]>
Date: 2015-07-09T05:18:39Z
[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when
handling Parquet LISTs in compatible mode
This PR is based on #7209 authored by Sephiroth-Lin.
Author: Weizhong Lin <[email protected]>
Closes #7314 from liancheng/spark-8928 and squashes the following commits:
75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when
handling LISTs in compatible mode
commit 09cb0d9c2dcb83818ced22ff9bd6a51688ea7ffe
Author: Wenchen Fan <[email protected]>
Date: 2015-07-09T07:26:25Z
[SPARK-8942][SQL] use double not decimal when cast double and float to
timestamp
Author: Wenchen Fan <[email protected]>
Closes #7312 from cloud-fan/minor and squashes the following commits:
a4589fa [Wenchen Fan] use double not decimal when cast double and float to
timestamp
commit f88b12537ee81d914ef7c51a08f80cb28d93c8ed
Author: lewuathe <[email protected]>
Date: 2015-07-09T15:16:26Z
[SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices,
values
Write missing pydocs in `SparseVector` attributes.
Author: lewuathe <[email protected]>
Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits:
51d9895 [lewuathe] Update docs
0480d35 [lewuathe] Merge branch 'master' into SPARK-6266
ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size,
indices, values
commit 23448a9e988a1b92bd05ee8c6c1a096c83375a12
Author: Davies Liu <[email protected]>
Date: 2015-07-09T16:20:16Z
[SPARK-8931] [SQL] Fallback to interpreted evaluation if failed to compile
in codegen
Exception will not be catched during tests.
cc marmbrus rxin
Author: Davies Liu <[email protected]>
Closes #7309 from davies/fallback and squashes the following commits:
969a612 [Davies Liu] throw exception during tests
f844f77 [Davies Liu] fallback
a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into
fallback
364a0d6 [Davies Liu] fallback to interpret mode if failed to compile
commit a1964e9d902bb31f001893da8bc81f6dce08c908
Author: Tarek Auel <[email protected]>
Date: 2015-07-09T16:22:24Z
[SPARK-8830] [SQL] native levenshtein distance
Jira: https://issues.apache.org/jira/browse/SPARK-8830
rxin and HuJiayin can you have a look on it.
Author: Tarek Auel <[email protected]>
Closes #7236 from tarekauel/native-levenshtein-distance and squashes the
following commits:
ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals
c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for
byte array comparison
ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance
179920a [Tarek Auel] [SPARK-8830] added description
5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import
dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance
commit 59cc38944fe5c1dffc6551775bd939e2ac66c65e
Author: Liang-Chi Hsieh <[email protected]>
Date: 2015-07-09T16:57:12Z
[SPARK-8940] [SPARKR] Don't overwrite given schema in createDataFrame
JIRA: https://issues.apache.org/jira/browse/SPARK-8940
The given `schema` parameter will be overwritten in `createDataFrame` now.
If it is not null, we shouldn't overwrite it.
Author: Liang-Chi Hsieh <[email protected]>
Closes #7311 from viirya/df_not_overwrite_schema and squashes the following
commits:
2385139 [Liang-Chi Hsieh] Don't overwrite given schema if it is not null.
commit e204d22bb70f28b1cc090ab60f12078479be4ae0
Author: Reynold Xin <[email protected]>
Date: 2015-07-09T17:01:01Z
[SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract class
Also added more documentation for the file.
Author: Reynold Xin <[email protected]>
Closes #7316 from rxin/extract-value and squashes the following commits:
069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal.
621b705 [Reynold Xin] Reverted a line.
11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type
extractors.
commit a870a82fb6f57bb63bd6f1e95da944a30f67519a
Author: Reynold Xin <[email protected]>
Date: 2015-07-09T17:01:33Z
[SPARK-8926][SQL] Code review followup.
I merged https://github.com/apache/spark/pull/7303 so it unblocks another
PR. This addresses my own code review comment for that PR.
Author: Reynold Xin <[email protected]>
Closes #7313 from rxin/adt and squashes the following commits:
7ade82b [Reynold Xin] Fixed unit tests.
f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
commit f6c0bd5c3755b2f9bab633a5d478240fdaf1c593
Author: Wenchen Fan <[email protected]>
Date: 2015-07-09T17:04:42Z
[SPARK-8938][SQL] Implement toString for Interval data type
Author: Wenchen Fan <[email protected]>
Closes #7315 from cloud-fan/toString and squashes the following commits:
4fc8d80 [Wenchen Fan] Implement toString for Interval data type
commit c59e268d17cf10e46dbdbe760e2a7580a6364692
Author: JPark <[email protected]>
Date: 2015-07-09T17:23:36Z
[SPARK-8863] [EC2] Check aws access key from aws credentials if there is no
boto config
'spark_ec2.py' use boto to control ec2.
And boto can support '~/.aws/credentials' which is AWS CLI default
configuration file.
We can check this information from ref of boto.
"A boto config file is a text file formatted like an .ini configuration
file that specifies values for options that control the behavior of the boto
library. In Unix/Linux systems, on startup, the boto library looks for
configuration files in the following locations and in the following order:
/etc/boto.cfg - for site-wide settings that all users on this machine will
use
(if profile is given) ~/.aws/credentials - for credentials shared between
SDKs
(if profile is given) ~/.boto - for user-specific settings
~/.aws/credentials - for credentials shared between SDKs
~/.boto - for user-specific settings"
* ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html
* ref of aws cli :
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
However 'spark_ec2.py' only check boto config & environment variable even
if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated.
So I changed to check '~/.aws/credentials'.
cc rxin
Jira : https://issues.apache.org/jira/browse/SPARK-8863
Author: JPark <[email protected]>
Closes #7252 from JuhongPark/master and squashes the following commits:
23c5792 [JPark] Check aws access key from aws credentials if there is no
boto config
commit 0cd84c86cac68600a74d84e50ad40c0c8b84822a
Author: Yuhao Yang <[email protected]>
Date: 2015-07-09T17:26:38Z
[SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert
document to words count vector
jira: https://issues.apache.org/jira/browse/SPARK-8703
Converts a text document to a sparse vector of token counts.
I can further add an estimator to extract vocabulary from corpus if that's
appropriate.
Author: Yuhao Yang <[email protected]>
Closes #7084 from hhbyyh/countVectorization and squashes the following
commits:
5f3f655 [Yuhao Yang] text change
24728e4 [Yuhao Yang] style improvement
576728a [Yuhao Yang] rename to model and some fix
1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into
countVectorization
99b0c14 [Yuhao Yang] undo extension from HashingTF
12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into
countVectorization
7ee1c31 [Yuhao Yang] extends HashingTF
809fb59 [Yuhao Yang] minor fix for ut
7c61fb3 [Yuhao Yang] add countVectorizer
commit 0b0b9ceaf73de472198c9804fb7ae61fa2a2e097
Author: Cheng Hao <[email protected]>
Date: 2015-07-09T18:11:34Z
[SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257]
[SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260]
[SPARK-8267] [SQL] Add String Expressions
Author: Cheng Hao <[email protected]>
Closes #6762 from chenghao-intel/str_funcs and squashes the following
commits:
b09a909 [Cheng Hao] update the code as feedback
7ebbf4c [Cheng Hao] Add more string expressions
commit 7ce3b818fb1ba3f291eda58988e4808e999cae3a
Author: Tathagata Das <[email protected]>
Date: 2015-07-09T20:19:36Z
[MINOR] [STREAMING] Fix log statements in ReceiverSupervisorImpl
Log statements incorrectly showed that the executor was being stopped when
receiver was being stopped.
Author: Tathagata Das <[email protected]>
Closes #7328 from tdas/fix-log and squashes the following commits:
9cc6e99 [Tathagata Das] Fix log statements.
commit 930fe95350f8865e2af2d7afa5b717210933cd43
Author: xutingjun <[email protected]>
Date: 2015-07-09T20:21:10Z
[SPARK-8953] SPARK_EXECUTOR_CORES is not read in SparkSubmit
The configuration ```SPARK_EXECUTOR_CORES``` won't put into
```SparkConf```, so it has no effect to the dynamic executor allocation.
Author: xutingjun <[email protected]>
Closes #7322 from XuTingjun/SPARK_EXECUTOR_CORES and squashes the following
commits:
2cafa89 [xutingjun] make SPARK_EXECUTOR_CORES has effect to
dynamicAllocation
commit 88bf430331eef3c02438ca441616034486e15789
Author: zsxwing <[email protected]>
Date: 2015-07-09T20:22:17Z
[SPARK-7419] [STREAMING] [TESTS] Fix CheckpointSuite.recovery with file
input stream
Fix this failure:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2886/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_with_file_input_stream/
To reproduce this failure, you can add `Thread.sleep(2000)` before this line
https://github.com/apache/spark/blob/a9c4e29950a14e32acaac547e9a0e8879fd37fc9/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala#L477
Author: zsxwing <[email protected]>
Closes #7323 from zsxwing/SPARK-7419 and squashes the following commits:
b3caf58 [zsxwing] Fix CheckpointSuite.recovery with file input stream
commit ebdf58538058e57381c04b6725d4be0c37847ed3
Author: Andrew Or <[email protected]>
Date: 2015-07-09T20:25:11Z
[SPARK-2017] [UI] Stage page hangs with many tasks
(This reopens a patch that was closed in the past: #6248)
When you view the stage page while running the following:
```
sc.parallelize(1 to X, 10000).count()
```
The page never loads, the job is stalled, and you end up running into an
OOM:
```
HTTP ERROR 500
Problem accessing /stages/stage/. Reason:
Server Error
Caused by:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
```
This patch compresses Jetty responses in gzip. The correct long-term fix is
to add pagination.
Author: Andrew Or <[email protected]>
Closes #7296 from andrewor14/gzip-jetty and squashes the following commits:
a051c64 [Andrew Or] Use GZIP to compress Jetty responses
commit c4830598b271cc6390d127bd4cf8ab02b28792e0
Author: Iulian Dragos <[email protected]>
Date: 2015-07-09T20:26:46Z
[SPARK-6287] [MESOS] Add dynamic allocation to the coarse-grained Mesos
scheduler
This is largely based on extracting the dynamic allocation parts from
tnachen's #3861.
Author: Iulian Dragos <[email protected]>
Closes #4984 from dragos/issue/mesos-coarse-dynamicAllocation and squashes
the following commits:
39df8cd [Iulian Dragos] Update tests to latest changes in core.
9d2c9fa [Iulian Dragos] Remove adjustment of executorLimitOption in
doKillExecutors.
8b00f52 [Iulian Dragos] Latest round of reviews.
0cd00e0 [Iulian Dragos] Add persistent shuffle directory
15c45c1 [Iulian Dragos] Add dynamic allocation to the Spark coarse-grained
scheduler.
commit 1f6b0b1234cc03aa2e07aea7fec2de7563885238
Author: zsxwing <[email protected]>
Date: 2015-07-09T20:48:29Z
[SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page
This PR adds `metadata` to `InputInfo`. `InputDStream` can report its
metadata for a batch and it will be shown in the batch page.
For example,

FileInputDStream will display the new files for a batch, and
DirectKafkaInputDStream will display its offset ranges.
Author: zsxwing <[email protected]>
Closes #7081 from zsxwing/input-metadata and squashes the following commits:
f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
d906209 [zsxwing] Merge branch 'master' into input-metadata
74762da [zsxwing] Fix MiMa tests
7903e33 [zsxwing] Merge branch 'master' into input-metadata
450a46c [zsxwing] Address comments
1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata"
to Map[String, Any]
d496ae9 [zsxwing] Add input metadata in the batch page
commit 3ccebf36c5abe04702d4cf223552a94034d980fb
Author: jerryshao <[email protected]>
Date: 2015-07-09T20:54:44Z
[SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python
This PR propose a simple way to expose OffsetRange in Python code, also the
usage of offsetRanges is similar to Scala/Java way, here in Python we could get
OffsetRange like:
```
dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
```
Reason I didn't follow the way what SPARK-8389 suggested is that: Python
Kafka API has one more step to decode the message compared to Scala/Java, Which
makes Python API return a transformed RDD/DStream, not directly wrapped
so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get
the offsetRange.
Author: jerryshao <[email protected]>
Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:
4c6d320 [jerryshao] Another way to fix subclass deserialization issue
e6a8011 [jerryshao] Address the comments
fd13937 [jerryshao] Fix serialization bug
7debf1c [jerryshao] bug fix
cff3893 [jerryshao] refactor the code according to the comments
2aabf9e [jerryshao] Style fix
848c708 [jerryshao] Add HasOffsetRanges for Python
commit c9e2ef52bb54f35a904427389dc492d61f29b018
Author: Davies Liu <[email protected]>
Date: 2015-07-09T21:43:38Z
[SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of
serialization for Python DataFrame
This PR fix the long standing issue of serialization between Python RDD and
DataFrame, it change to using a customized Pickler for InternalRow to enable
customized unpickling (type conversion, especially for UDT), now we can support
UDT for UDF, cc mengxr .
There is no generated `Row` anymore.
Author: Davies Liu <[email protected]>
Closes #7301 from davies/sql_ser and squashes the following commits:
81bef71 [Davies Liu] address comments
e9217bd [Davies Liu] add regression tests
db34167 [Davies Liu] Refactor of serialization for Python DataFrame
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]