GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/11601
[SPARK-13568] [ML] Create feature transformer to impute missing values
## What changes were proposed in this pull request?
It is quite common to encounter missing values in data sets. It would be
useful to implement a Transformer that can impute missing data points, similar
to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most
frequent, but we could add various other approaches. Where possible existing
DataFrame code can be used (e.g. for approximate quantiles etc).
Currently this PR supports imputation for Double and Vector (null and NaN
in Vector).
## How was this patch tested?
new unit tests and manual test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark imputer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11601.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11601
----
commit 2999b268192e244bd7a520d62a0914e4742ee45d
Author: Yuhao Yang <[email protected]>
Date: 2016-02-29T17:46:04Z
initial commit for Imputer
commit 8335cf21ebde164a22f3447000a1c468a69f39fc
Author: Yuhao Yang <[email protected]>
Date: 2016-02-29T18:27:40Z
adjust mean and most
commit 7be5e9bcb2c9cd7671d128b01f5090ee737d207a
Author: Yuhao Yang <[email protected]>
Date: 2016-03-02T17:44:50Z
Merge remote-tracking branch 'upstream/master' into imputer
commit 131f7d5b061a75242e7c305ba14c8c759d09c532
Author: Yuhao Yang <[email protected]>
Date: 2016-03-03T03:07:21Z
Merge remote-tracking branch 'upstream/master' into imputer
commit a72a3ea81f6f76439068650cf47e4f784e0c4b7c
Author: Yuhao Yang <[email protected]>
Date: 2016-03-05T19:00:37Z
Merge remote-tracking branch 'upstream/master' into imputer
commit 78df589e488bbec963b3969012cf9266fe4895cb
Author: Yuhao Yang <[email protected]>
Date: 2016-03-07T20:26:00Z
Merge remote-tracking branch 'upstream/master' into imputer
commit b949be5746608ca3861df672ccd76d9af4257ae2
Author: Yuhao Yang <[email protected]>
Date: 2016-03-09T02:19:32Z
refine code and add ut
commit 79b1c62b644aa05f07a33f13cc78f47a99d7e861
Author: Yuhao Yang <[email protected]>
Date: 2016-03-09T02:19:39Z
Merge remote-tracking branch 'upstream/master' into imputer
commit c3d5d554f5ee90a18d96ff043f03f51f49d2ca7f
Author: Yuhao Yang <[email protected]>
Date: 2016-03-09T03:52:04Z
minor change
commit 1b3966800982fa980307d1b6ded6e28e5f5985e8
Author: Yuhao Yang <[email protected]>
Date: 2016-03-09T07:57:38Z
add object Imputer and ut refine
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]