[
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572454#comment-17572454
]
Christophe Préaud commented on SPARK-39910:
-------------------------------------------
This is just a straightforward example to demonstrate the issue, but the main
problem is that all the file formats natively supported by DataFrameReader
(parquet, JSON, CSV, ...) cannot be read when they are contained in a hadoop
archive.
> DataFrameReader API cannot read files from hadoop archives (.har)
> -----------------------------------------------------------------
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
> Reporter: Christophe Préaud
> Priority: Minor
> Labels: DataFrameReader
>
> Reading a file from an hadoop archive using the DataFrameReader API returns
> an empty Dataset:
> {code:java}
> scala> val df =
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>
> On the other hand, reading the same file, from the same hadoop archive, but
> using the RDD API yields the correct result:
> {code:java}
> scala> val df =
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]