[ 
https://issues.apache.org/jira/browse/SPARK-43790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727052#comment-17727052
 ] 

ASF GitHub Bot commented on SPARK-43790:
----------------------------------------

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/41357

> Add API `copyLocalFileToHadoopFS`
> ---------------------------------
>
>                 Key: SPARK-43790
>                 URL: https://issues.apache.org/jira/browse/SPARK-43790
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Connect, ML, PySpark
>    Affects Versions: 3.5.0
>            Reporter: Weichen Xu
>            Assignee: Weichen Xu
>            Priority: Major
>
> In new distributed spark ML module (designed to support spark connect and 
> support local inference)
> We need to save ML model to hadoop file system using custom binary file 
> format, the reason is:
>  * We often submit a spark application to spark cluster for running the 
> training model job, we need to save trained model to hadoop file system 
> before the spark application completes.
>  * But we want to support local model inference, that means if we save the 
> model by current spark DataFrame writer (e.g. parquet format), when loading 
> model we have to rely on the spark service. But we hope we can load model 
> without spark service. So we want the model being saved as the original 
> binary format that our ML code can handle.
>  
> So we need to add an API like `copyLocalFileToHadoopFS`,
> The implementation of `copyLocalFileToHadoopFS` could be:
>  
> (1) call `add_artifact` API to upload local file to spark driver (spark 
> connect already support this)
> (2) implement a pyspark (spark connect client) API: 
> `copy_artifact_to_hadoop_fs`, the API sends a command to spark driver to 
> request upload the artifact file to hadoop FS, *we need to design a spark 
> connect protobuf command message for this part.* In spark driver side, when 
> spark connect server received the request, it gets `sparkContext.hadoopConf` 
> and then using Hadoop FileSystem API to upload file to Hadoop FS.
> (3) call `copy_artifact_to_hadoop_fs` API to upload artifact file to Hadoop 
> FS.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to