GitHub user thunterdb opened a pull request:
https://github.com/apache/spark/pull/17419
[SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API
## What changes were proposed in this pull request?
This patch adds the DataFrames API to the multivariate summarizer (mean,
variance, etc.). In addition to all the features of
`MultivariateOnlineSummarizer`, it also allows the user to select a subset of
the metrics. This should resolve some performance issues related to computing
unrequested metrics.
Furthermore, it uses the BLAS API to the extent possible, so that the given
code should be efficient for the dense case.
## How was this patch tested?
This patch includes most of the tests of the RDD-based. It compares results
against the existing `MultivariateOnlineSummarizer` as well as adding more
tests.
This patch also includes some documentation for some low-level constructs
such as `TypedImperativeAggregate`.
## Performance
I have not run tests against the existing implementation. However, this
patch uses the recommended low-level SQL APIs, so it should be interesting to
compare both implementation in that respect.
## WIP
Marked as WIP because some debugging comments are still present in the code.
Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/thunterdb/spark 19634
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17419.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17419
----
commit f3fa6580bca70f3307d70e938ef8531c928d958b
Author: Timothy Hunter <[email protected]>
Date: 2017-03-03T18:36:02Z
work
commit 7539835dad863a6b73d88d79983342f9ddb7fb9d
Author: Timothy Hunter <[email protected]>
Date: 2017-03-06T22:38:41Z
work on the test suite
commit 673943f334b94e5d1ecd8874cb82bbc875d739c6
Author: Timothy Hunter <[email protected]>
Date: 2017-03-07T00:01:30Z
last work
commit 202b672afec127f4e0885cf3a58f4dfc97031fc6
Author: Timothy Hunter <[email protected]>
Date: 2017-03-13T22:48:47Z
work on using imperative aggregators
commit be019813f241d0ad3559b4d84339f1bb1055cbc4
Author: Timothy Hunter <[email protected]>
Date: 2017-03-17T21:44:40Z
Merge remote-tracking branch 'upstream/master' into 19634
commit a983284cfeddabd017792e3991cf99a7d3ab1e16
Author: Timothy Hunter <[email protected]>
Date: 2017-03-18T00:14:40Z
more work on summarizer
commit 647a4fecb17d478c3c8cd68d40f2a9456eb10c66
Author: Timothy Hunter <[email protected]>
Date: 2017-03-21T17:47:30Z
work
commit 3c4bef772a3cbc759e43223af658a357c5ca6bc2
Author: Timothy Hunter <[email protected]>
Date: 2017-03-21T18:54:16Z
changes
commit 56390ccc456c67b2f7a08c1271fa50408518da0f
Author: Timothy Hunter <[email protected]>
Date: 2017-03-21T18:54:19Z
Merge remote-tracking branch 'upstream/master' into 19634
commit c3f236c4422031ae818cb6bbec2415b3f1bf7b70
Author: Timothy Hunter <[email protected]>
Date: 2017-03-21T19:03:07Z
cleanup
commit ef955c00275705f14342f3e4ed970a78f0f3c141
Author: Timothy Hunter <[email protected]>
Date: 2017-03-21T22:42:42Z
debugging
commit a04f923913ca1118a61d66bd53b8514af62594d7
Author: Timothy Hunter <[email protected]>
Date: 2017-03-21T23:14:23Z
work
commit 946d490c8b29e55ec0e6d40785122269063894ad
Author: Timothy Hunter <[email protected]>
Date: 2017-03-22T21:14:29Z
Merge remote-tracking branch 'upstream/master' into 19634
commit 201eb7712054967cd5093d3a908f4ebbd73f30a8
Author: Timothy Hunter <[email protected]>
Date: 2017-03-22T21:19:57Z
debug
commit f4dec88a49d0a20e1b328617fd721633fd8c201a
Author: Timothy Hunter <[email protected]>
Date: 2017-03-23T18:27:19Z
trying to debug serialization issue
commit 4af0f47d326ef91d7cf9ccaf6a45ee3f904b191f
Author: Timothy Hunter <[email protected]>
Date: 2017-03-23T23:16:10Z
better tests
commit 9f29030f75089884156bdc4ee634857b3730114d
Author: Timothy Hunter <[email protected]>
Date: 2017-03-24T00:12:28Z
changes
commit e9877dc2f08d393f079bdf6fbbf1b9b9abaa21da
Author: Timothy Hunter <[email protected]>
Date: 2017-03-24T21:04:32Z
debugging
commit 3a11d0265ef665a63cd070eeb1ae4ac25bc89908
Author: Timothy Hunter <[email protected]>
Date: 2017-03-24T22:14:06Z
more tests and debugging
commit 6d26c17d0bd4ab18d564ee7f37916780211702d5
Author: Timothy Hunter <[email protected]>
Date: 2017-03-24T23:12:19Z
fixed tests
commit 35eaeb0d02ae9cc29ae559231fe4858935315477
Author: Timothy Hunter <[email protected]>
Date: 2017-03-24T23:23:15Z
doc
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]