Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/245#issuecomment-39141551
@yinxusen Yes, feature transformation should be done before learning
algorithms. This gives a better separation. It also allows us to plug in more
powerful tools for feature transformation in the future. I'm thinking about
PMML at this time but there might be other options. User should decide whether
to cache the data before transformation or after. Sometimes it is expensive to
cache the one after because of densification or explosion of feature space. But
IMHO this shouldn't be handled by learning algorithms. Ideally, feature
transformation includes adding intercept. But since it is used very common, I
leave the option there but set default to false. Prepending intercept needs
re-allocation of vectors. You can see the different easily.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---