[jira] [Commented] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

Jira Thu, 28 Jul 2022 06:16:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572454#comment-17572454
 ]


Christophe Préaud commented on SPARK-39910:
-------------------------------------------

This is just a straightforward example to demonstrate the issue, but the main 
problem is that all the file formats natively supported by DataFrameReader 
(parquet, JSON, CSV, ...) cannot be read when they are contained in a hadoop 
archive.

> DataFrameReader API cannot read files from hadoop archives (.har)
> -----------------------------------------------------------------
>
>                 Key: SPARK-39910
>                 URL: https://issues.apache.org/jira/browse/SPARK-39910
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>            Reporter: Christophe Préaud
>            Priority: Minor
>              Labels: DataFrameReader
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

Reply via email to