Felix Cheung created SPARK-22925:
------------------------------------
Summary: ml model persistence creates a lot of small files
Key: SPARK-22925
URL: https://issues.apache.org/jira/browse/SPARK-22925
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 2.2.1, 2.1.2, 2.3.0
Reporter: Felix Cheung
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
In the former case issue such as SPARK-19294 has been reported for having very
large single file.
Whereas in the latter case, model such as RandomForestModel could create
hundreds or thousands of file which is also unmanageable. Looking into this,
there is no simple way to set/change spark.default.parallelism while the app is
running since SparkConf seems to be copied/cached by the backend without a way
to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
It seems we need to have a way to make it settable on a per-use basis.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]