[ 
https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-43715:
-------------------------------
    Description: 
In new distributed spark ML module (designed to support spark connect and 
support local inference)

We need to save ML model to hadoop file system using custom binary file format, 
the reason is:
 * The training model job is a spark job, we need to save trained model to 
hadoop file sytem after the job completes.
 * But we want to support local model inference, that means if we save the 
model by current spark DataFrame writer (e.g. parquet format), when loading 
model we have to rely on the spark service. But we hope we can load model 
without spark service. So we want the model being saved as the original binary 
format that our ML code can handle.

so we need to add a DataFrame reader / writer format, that can load / save 
binary files, the API is like:

 

{*}Writer API{*}:

Supposing we have a dataframe with schema:

[file_path: String, content: binary],

we can save the dataframe to a hadoop path, each row we will save it as a file 
under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, 
"file_path" can be a multiple part path.

 

{*}Reader API{*}:

`spark.read.format("binaryFileV2").load(...)`

 

It will return a spark dataframe , each row contains the file path and the file 
content binary string.

 

  was:
In new distributed spark ML module (designed to support spark connect and 
support local inference)

We need to save ML model to hadoop file system using custom binary file format, 
the reason is:
 * The training model job is a spark job, we need to save trained model to 
hadoop file sytem after the job completes.
 * But we want to support local model inference, that means if we save the 
model by current spark DataFrame writer (e.g. parquet format), when loading 
model we have to rely on the spark service. But we hope we can load model 
without spark service. So we want the model being saved as the original binary 
format that our ML code can handle.

so we need to add a DataFrame reader / writer format, that can load / save 
binary files, the API is like:

 

{*}Writer API{*}:

Supposing we have a dataframe with schema:

[file_path: String, content: binary],

we can save the dataframe to a hadoop path, each row we will save it as a file 
under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, 
"file_path" can be a multiple part path.

 

Reader API:

`spark.read.format("binaryFileV2").load(...)`

 

It will return a spark dataframe , each row contains the file path and the file 
content binary string.

 


> Add spark DataFrame binary file reader / writer
> -----------------------------------------------
>
>                 Key: SPARK-43715
>                 URL: https://issues.apache.org/jira/browse/SPARK-43715
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 3.5.0
>            Reporter: Weichen Xu
>            Priority: Major
>
> In new distributed spark ML module (designed to support spark connect and 
> support local inference)
> We need to save ML model to hadoop file system using custom binary file 
> format, the reason is:
>  * The training model job is a spark job, we need to save trained model to 
> hadoop file sytem after the job completes.
>  * But we want to support local model inference, that means if we save the 
> model by current spark DataFrame writer (e.g. parquet format), when loading 
> model we have to rely on the spark service. But we hope we can load model 
> without spark service. So we want the model being saved as the original 
> binary format that our ML code can handle.
> so we need to add a DataFrame reader / writer format, that can load / save 
> binary files, the API is like:
>  
> {*}Writer API{*}:
> Supposing we have a dataframe with schema:
> [file_path: String, content: binary],
> we can save the dataframe to a hadoop path, each row we will save it as a 
> file under the hadoop path, the saved file path is \{hadoop 
> path}/\{file_path}, "file_path" can be a multiple part path.
>  
> {*}Reader API{*}:
> `spark.read.format("binaryFileV2").load(...)`
>  
> It will return a spark dataframe , each row contains the file path and the 
> file content binary string.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to