[jira] [Comment Edited] (SPARK-50476) Unable to run custom PDF Data Source on Databricks

Gengliang Wang (Jira) Fri, 06 Dec 2024 16:11:25 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-50476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903772#comment-17903772
 ]


Gengliang Wang edited comment on SPARK-50476 at 12/7/24 12:10 AM:
------------------------------------------------------------------

Hi [~mmelnyk] 

Thanks for reporting this issue!

Since this pertains to Databricks Runtime rather than the Apache Spark™ 
community version, it would be best to create a support ticket with Databricks 
Support for such cases. This ensures the right team can assist you promptly.

Regarding the error, the {{PartitionedFile}} class you are using is an internal 
Spark class, which means it isn't guaranteed to remain backward compatible 
between Spark releases or between Spark and Databricks Runtime. That said, 
we've looked into this issue and made a fix. The fix will be available in 
Databricks Runtime(version 15.4 and above) around {*}January 20, 2025{*}.

Please give it a try around that time and let us know if the issue is resolved.

Thank you!


was (Author: gengliang.wang):
Hi [~mmelnyk] 

Thanks for reporting this issue!

Since this pertains to Databricks Runtime rather than the Apache Spark™ 
community version, it would be best to create a support ticket with Databricks 
Support for such cases. This ensures the right team can assist you promptly.

Regarding the error, the {{PartitionedFile}} class you are using is an internal 
Spark class, which means it isn't guaranteed to remain backward compatible 
between Spark releases or between Spark and Databricks Runtime. That said, 
we've looked into this issue and made a fix. The fix will be available in 
Databricks Runtime around {*}January 20, 2025{*}.

Please give it a try around that time and let us know if the issue is resolved.

Thank you!

> Unable to run custom PDF Data Source on Databricks
> --------------------------------------------------
>
>                 Key: SPARK-50476
>                 URL: https://issues.apache.org/jira/browse/SPARK-50476
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>         Environment: Databricks Runtime: 14.3 LTS (includes Apache Spark 
> 3.5.0, Scala 2.12)
>            Reporter: Mykola Melnyk
>            Priority: Minor
>         Attachments: PdfDataSourceDatabricks.ipynb, traceback.txt
>
>
> Experienced error when running custom PDF DataSource on Databricks Runtime: 
> 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
> PDF DataSource works fine on the community version of the Spark 3.5.0, but 
> fail on Databricks:
> Log:
> Py4JJavaError: An error occurred while calling o428.showString.
> : java.lang.NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/paths/SparkPath;JJ[Ljava/lang/String;JJLscala/collection/immutable/Map;)V
> at 
> com.stabrise.sparkpdf.datasources.PdfPartitionedFileUtil$.$anonfun$splitFiles$1(PdfPartitionedFileUtil.scala:32)
> at 
> com.stabrise.sparkpdf.datasources.PdfPartitionedFileUtil$.$anonfun$splitFiles$1$adapted(PdfPartitionedFileUtil.scala:28)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> [Source Code of the PDF 
> DataSource|https://github.com/StabRise/spark-pdf/tree/spark_3_5]
> [Source code of PdfPartitionedFileUtil.scala 
> |https://github.com/StabRise/spark-pdf/blob/spark_3_5/src/main/scala/datasources/PdfPartitionedFileUtil.scala#L32]
> PDF DataSource jar file:  
> [https://github.com/StabRise/spark-pdf/releases/download/0.1.12_spark_3_5/spark-pdf-0.1.12.jar]
> Notebook with full example and traceback: 
> [https://github.com/StabRise/spark-pdf/blob/spark_3_5/examples/PdfDataSourceDatabricks.ipynb|https://github.com/StabRise/spark-pdf/blob/spark_3_5/examples/PdfDataSourceDatabricks.ipynb]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-50476) Unable to run custom PDF Data Source on Databricks

Reply via email to