Joseph K. Bradley created SPARK-9704:
----------------------------------------
Summary: Make some ML APIs public: VectorUDT, Identifiable,
ProbabilisticClassifier
Key: SPARK-9704
URL: https://issues.apache.org/jira/browse/SPARK-9704
Project: Spark
Issue Type: Improvement
Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
This JIRA is for making several ML APIs public to make it easier for users to
write their own Pipeline stages.
Issue brought up by [~eronwright]. Descriptions below copied from
[http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html].
We plan to make these APIs public in Spark 1.5. However, they will be marked
DeveloperApi and are *very likely* to be broken in the future.
* VectorUDT: To define a relation with a vector field, VectorUDT must be
instantiated.
* Identifiable trait: The trait generates a unique identifier for the
associated pipeline component. Nice to have a consistent format by reusing the
trait.
* ProbabilisticClassifier. Third-party components should leverage the complex
logic around computing only selected columns.
We will not yet make these public:
* SchemaUtils: Third-party pipeline components have a need for checking column
types and appending columns.
** This will probably be moved into Spark SQL. Users can copy the methods into
their own code as needed.
* Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but
reiterating it here.
** We need to discuss whether these should be standardized public APIs. Users
can copy the traits into their own code as needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]