[
https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weichen Xu reassigned SPARK-43715:
----------------------------------
Assignee: Weichen Xu
> Add spark DataFrame binary file reader / writer
> -----------------------------------------------
>
> Key: SPARK-43715
> URL: https://issues.apache.org/jira/browse/SPARK-43715
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Affects Versions: 3.5.0
> Reporter: Weichen Xu
> Assignee: Weichen Xu
> Priority: Major
>
> In new distributed spark ML module (designed to support spark connect and
> support local inference)
> We need to save ML model to hadoop file system using custom binary file
> format, the reason is:
> * We often submit a spark application to spark cluster for running the
> training model job, we need to save trained model to hadoop file system
> before the spark application completes.
> * But we want to support local model inference, that means if we save the
> model by current spark DataFrame writer (e.g. parquet format), when loading
> model we have to rely on the spark service. But we hope we can load model
> without spark service. So we want the model being saved as the original
> binary format that our ML code can handle.
> so we need to add a DataFrame reader / writer format, that can load / save
> binary files, the API is like:
>
> {*}Writer API{*}:
> Supposing we have a dataframe with schema:
> [file_path: String, content: binary],
> we can save the dataframe to a hadoop path, each row we will save it as a
> file under the hadoop path, the saved file path is \{hadoop
> path}/\{file_path}, "file_path" can be a multiple part path.
>
> {*}Reader API{*}:
> `spark.read.format("binaryFileV2").load(...)`
>
> It will return a spark dataframe , each row contains the file path and the
> file content binary string.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]