[ 
https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-43715:
-------------------------------
    Description: 
In new distributed spark ML module (designed to support spark connect and 
support local inference)

We need to save ML model to hadoop file system using custom binary file format, 
the reason is:
 * We often submit a spark application to spark cluster for running the 
training model job, we need to save trained model to hadoop file system before 
the spark application completes.
 * But we want to support local model inference, that means if we save the 
model by current spark DataFrame writer (e.g. parquet format), when loading 
model we have to rely on the spark service. But we hope we can load model 
without spark service. So we want the model being saved as the original binary 
format that our ML code can handle.

We already have reader API of "binaryFile" format, we need to add a writer API:

{*}Writer API{*}:

Supposing we have a dataframe with schema:

[file_path: String, content: binary],

we can save the dataframe to a hadoop path, each row we will save it as a file 
under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, 
"file_path" can be a multiple part path.

  was:
In new distributed spark ML module (designed to support spark connect and 
support local inference)

We need to save ML model to hadoop file system using custom binary file format, 
the reason is:
 * We often submit a spark application to spark cluster for running the 
training model job, we need to save trained model to hadoop file system before 
the spark application completes.
 * But we want to support local model inference, that means if we save the 
model by current spark DataFrame writer (e.g. parquet format), when loading 
model we have to rely on the spark service. But we hope we can load model 
without spark service. So we want the model being saved as the original binary 
format that our ML code can handle.

so we need to add a DataFrame reader / writer format, that can load / save 
binary files, the API is like:

 

{*}Writer API{*}:

Supposing we have a dataframe with schema:

[file_path: String, content: binary],

we can save the dataframe to a hadoop path, each row we will save it as a file 
under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, 
"file_path" can be a multiple part path.

 

{*}Reader API{*}:

`spark.read.format("binaryFileV2").load(...)`

 

It will return a spark dataframe , each row contains the file path and the file 
content binary string.

 


> Add spark DataFrame binary file format writer
> ---------------------------------------------
>
>                 Key: SPARK-43715
>                 URL: https://issues.apache.org/jira/browse/SPARK-43715
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 3.5.0
>            Reporter: Weichen Xu
>            Assignee: Weichen Xu
>            Priority: Major
>
> In new distributed spark ML module (designed to support spark connect and 
> support local inference)
> We need to save ML model to hadoop file system using custom binary file 
> format, the reason is:
>  * We often submit a spark application to spark cluster for running the 
> training model job, we need to save trained model to hadoop file system 
> before the spark application completes.
>  * But we want to support local model inference, that means if we save the 
> model by current spark DataFrame writer (e.g. parquet format), when loading 
> model we have to rely on the spark service. But we hope we can load model 
> without spark service. So we want the model being saved as the original 
> binary format that our ML code can handle.
> We already have reader API of "binaryFile" format, we need to add a writer 
> API:
> {*}Writer API{*}:
> Supposing we have a dataframe with schema:
> [file_path: String, content: binary],
> we can save the dataframe to a hadoop path, each row we will save it as a 
> file under the hadoop path, the saved file path is \{hadoop 
> path}/\{file_path}, "file_path" can be a multiple part path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to