Re: Make ML Developer APIs public (post-1.4)

2015-08-06 Thread Joseph Bradley
Eron,

Thanks for sending out this list!  We can make some of the critical ones
public for 1.5, but they will be marked DeveloperApi since they may require
changes in the future.  Just made the JIRA: [
https://issues.apache.org/jira/browse/SPARK-9704] and I'll send a PR soon.

Joseph

On Mon, Aug 3, 2015 at 4:51 PM, Eron Wright ewri...@live.com wrote:


 Hello,

 In developing new *third-party* *pipeline components* for Spark ML 1.4
 (see dl4j-spark-ml), I encountered a few gaps in the earlier effort to make
 the ML Developer APIs public (SPARK-5995).I plan to file issues after
 we discuss on this thread.   The below is a list of types that are
 presently private but might best be made public.

1. *VectorUDT*.To define a relation with a vector field,
 VectorUDT must be instantiated.
2. *SchemaUtils*.   Third-party pipeline components have a need for
checking column types and appending columns.
3. *Identifiable trait*.   The trait generates a unique identifier for
the associated pipeline component.  Nice to have a consistent format by
reusing the trait.
4. *ProbabilisticClassifier*.  Third-party components should leverage
the complex logic around computing only selected columns.
5. *Shared Params* (HasLabel, HasFeatures).   This is covered in
SPARK-7146 but reiterating it here.

 Thanks,
 Eron Wright



Make ML Developer APIs public (post-1.4)

2015-08-03 Thread Eron Wright

Hello,

In developing new third-party pipeline components for Spark ML 1.4 (see 
dl4j-spark-ml), I encountered a few gaps in the earlier effort to make the ML 
Developer APIs public (SPARK-5995).I plan to file issues after we discuss 
on this thread.   The below is a list of types that are presently private but 
might best be made public.
VectorUDT.To define a relation with a vector field,  VectorUDT must be 
instantiated.
SchemaUtils.   Third-party pipeline components have a need for checking column 
types and appending columns.
Identifiable trait.   The trait generates a unique identifier for the 
associated pipeline component.  Nice to have a consistent format by reusing the 
trait.
ProbabilisticClassifier.  Third-party components should leverage the complex 
logic around computing only selected columns.
Shared Params (HasLabel, HasFeatures).   This is covered in SPARK-7146 but 
reiterating it here.
Thanks,
Eron Wright