Github user wangmiao1981 commented on the issue:
https://github.com/apache/spark/pull/15770
Joseph K. Bradley added a comment - 31/Oct/16 18:14
Miao Wang Sorry for the slow response here. I do want us to add PIC to
spark.ml, but we should discuss the design before the PR. Could you please
close the PR for now but save the branch to re-open after discussion?
Let's have a design discussion first.
I agree that the big issue is that there isn't a clear way to make
predictions on new data points. In fact, I've never heard of people trying to
do so. Has anyone else?
Assuming that prediction is not meaningful for PIC, then I don't think the
algorithm fits within the Pipeline framework, though it's debatable. I see a
few options:
Put PIC in Pipelines as a Transformer, not an Estimator. We would just
need to document that it is a very expensive Transformer.
Put PIC in spark.ml as a static method. We may have to do this anyways
to support all of spark.mllib's Statistics.
Put PIC in GraphFrames (and push harder for GraphFrames to be merged
back into Spark, which will include a much longer set of improvements).
My top choice is PIC as a Transformer. What do you think?
CC Yanbo Liang Seth Hendrickson Nick Pentreath opinions?
sethah Seth Hendrickson added a comment - 31/Oct/16 22:40
This seems like it fits the framework of a feature transformer. We could
generate a real-valued feature column using PIC algorithm where the values are
just the components of the pseudo-eigenvector. Alternatively we could pipeline
a KMeans clustering on the end, but I think it makes more sense to let users do
that themselves - but that's up for debate.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]