[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305973#comment-14305973 ]
Sean Owen commented on SPARK-4587: ---------------------------------- Coming late to the discussion, with a few comments on the design doc: - PMML is not a Zementis-led format BTW - I don't find the verbosity of XML to be a big problem, with the possible exception of decision forests. It compresses well. - Export to PMML is much easier than import, to the extent that I am not sure I would even bother with import in the medium term. so much would be unsupported that it would probably cause more confusion than help. - jpmml-evaluator is not needed for model import/export; the part that's needed is BSD 3-clause This Parquet-based format is internal to Spark? then why not just use the model's serialized form (modulo the issue of distributed models, of course). If it's not internal, it looks like yet another format that is a subset of PMML, written differently, that nothing else will read. What does it add? is its role really for serializing pipelines? The unsupported model types are really an issue though. You can make up your own serialization in an Extension element though, which is perhaps better than conceiving a wholly separate format. I still imagine a huge factored matrix model can't reasonably be contained in any file format; I have resorted to just recording pointers to the location of the distributed data in a model file. I suppose I'm worried about ending up with a bunch of half-finished imports/exports rather than focusing. IMHO top priority should be PMML export and then see how it goes. > Model export/import > ------------------- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Joseph K. Bradley > Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org