[jira] [Updated] (SPARK-22925) ml model persistence creates a lot of small files

Felix Cheung (JIRA) Fri, 29 Dec 2017 12:33:47 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-22925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Felix Cheung updated SPARK-22925:
---------------------------------
    Description: 
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as SPARK-19294 have been reported for making 
very large single file.

Whereas in the latter case, models such as RandomForestModel could create 
hundreds or thousands of files which is also unmanageable. Looking into this, 
there is no simple way to set/change spark.default.parallelism (which would be 
pick up by sc.parallelize) while the app is running since SparkConf seems to be 
copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.


  was:
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issue such as SPARK-19294 has been reported for having very 
large single file.

Whereas in the latter case, model such as RandomForestModel could create 
hundreds or thousands of file which is also unmanageable. Looking into this, 
there is no simple way to set/change spark.default.parallelism while the app is 
running since SparkConf seems to be copied/cached by the backend without a way 
to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.



> ml model persistence creates a lot of small files
> -------------------------------------------------
>
>                 Key: SPARK-22925
>                 URL: https://issues.apache.org/jira/browse/SPARK-22925
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.2, 2.2.1, 2.3.0
>            Reporter: Felix Cheung
>
> Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
> repartition(1) but in some other models we don't.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
> In the former case issues such as SPARK-19294 have been reported for making 
> very large single file.
> Whereas in the latter case, models such as RandomForestModel could create 
> hundreds or thousands of files which is also unmanageable. Looking into this, 
> there is no simple way to set/change spark.default.parallelism (which would 
> be pick up by sc.parallelize) while the app is running since SparkConf seems 
> to be copied/cached by the backend without a way to update them.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
> It seems we need to have a way to make it settable on a per-use basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-22925) ml model persistence creates a lot of small files

Reply via email to