GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/12611
[SPARK-14843][ML] Fix encoding error in LibSVMRelation
## What changes were proposed in this pull request?
We use `RowEncoder` in libsvm data source to serialize the label and
features read from libsvm files. However, the schema passed in this encoder is
not correct. As the result, we can't correctly select `features` column from
the DataFrame. We should use full data schema instead of `requiredSchema` to
serialize the data read in. Then do projection to select required columns later.
## How was this patch tested?
`LibSVMRelationSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 fix-libsvm
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12611.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12611
----
commit 1ceed49861e992693f5812cc1f14270a17a9694e
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-04-22T13:38:44Z
Use correct schema for RowEncoder.
commit 5777ee5b6bd1016d652e55394a387fc728accba0
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-04-22T13:48:45Z
Add test.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]