[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305973#comment-14305973
 ] 

Sean Owen commented on SPARK-4587:
----------------------------------

Coming late to the discussion, with a few comments on the design doc:

- PMML is not a Zementis-led format BTW
- I don't find the verbosity of XML to be a big problem, with the possible 
exception of decision forests. It compresses well.
- Export to PMML is much easier than import, to the extent that I am not sure I 
would even bother with import in the medium term. so much would be unsupported 
that it would probably cause more confusion than help.
- jpmml-evaluator is not needed for model import/export; the part that's needed 
is BSD 3-clause

This Parquet-based format is internal to Spark? then why not just use the 
model's serialized form (modulo the issue of distributed models, of course). If 
it's not internal, it looks like yet another format that is a subset of PMML, 
written differently, that nothing else will read. What does it add? is its role 
really for serializing pipelines?

The unsupported model types are really an issue though. You can make up your 
own serialization in an Extension element though, which is perhaps better than 
conceiving a wholly separate format. I still imagine a huge factored matrix 
model can't reasonably be contained in any file format; I have resorted to just 
recording pointers to the location of the distributed data in a model file. 

I suppose I'm worried about ending up with a bunch of half-finished 
imports/exports rather than focusing. IMHO top priority should be PMML export 
and then see how it goes.

> Model export/import
> -------------------
>
>                 Key: SPARK-4587
>                 URL: https://issues.apache.org/jira/browse/SPARK-4587
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to