[jira] [Updated] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?

Hyukjin Kwon (Jira) Sun, 07 Aug 2022 21:54:33 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-39994:
---------------------------------
    Target Version/s:   (was: 3.3.0)

> How to write (save) PySpark dataframe containing vector column?
> ---------------------------------------------------------------
>
>                 Key: SPARK-39994
>                 URL: https://issues.apache.org/jira/browse/SPARK-39994
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Muhammad Kaleem Ullah
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: df.PNG, error.PNG
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I'm trying to same the PySpark dataframe after transforming it using ML 
> Pipeline. But when I save it the weird error is triggered every time. Here 
> are the columns of this dataframe:
> |-- label: integer (nullable = true)
> |-- dest_index: double (nullable = false)
> |-- dest_fact: vector (nullable = true)
> |-- carrier_index: double (nullable = false)
> |-- carrier_fact: vector (nullable = true)
> |-- features: vector (nullable = true)
> And the following error occurs when trying to save this dataframe that 
> contains vector data:
> {code:java}
> // training.write.parquet("training_files.parquet", mode = "overwrite") {code}
> {noformat}
> Py4JJavaError: An error occurred while calling o440.parquet. : 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> ...
> {noformat}
>  
> I tried to use differently available {{winutils}} for Hadoop from [this 
> GitHub repository|https://github.com/cdarlint/winutils] but with not much 
> luck. Please help me in this regard. How can I save this dataframe so that I 
> can read it in any other jupyter notebook file? Feel free to ask any 
> questions. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?

Reply via email to