[
https://issues.apache.org/jira/browse/SPARK-31177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765139#comment-17765139
]
Avi minsky commented on SPARK-31177:
------------------------------------
[~markwaddle] , [~maropu] how was this resolved?
> DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has
> non-".gz" extension
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-31177
> URL: https://issues.apache.org/jira/browse/SPARK-31177
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.4.4
> Reporter: Mark Waddle
> Priority: Major
> Labels: bulk-closed
>
> i have large CSV files that are gzipped and uploaded to S3 with
> Content-Encoding=gzip. the files have file extension ".csv", as most web
> clients will automatically decompress the file based on the Content-Encoding
> header. using pyspark to read these CSV files does not mimic this behavior.
> works as expected:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
> {code}
> does not decompress and tries to load entire contents of file as the first
> row:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv', header=True)
> {code}
> it looks like it's relying on the file extension to determine if the file is
> gzip compressed or not. it would be great if S3 resources, and any other http
> based resources, could consult the Content-Encoding response header as well.
> i tried to find the code that determines this, but i'm not familiar with the
> code base. any pointers would be helpful. and i can look into fixing it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]