[ 
https://issues.apache.org/jira/browse/SPARK-31177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765139#comment-17765139
 ] 

Avi minsky commented on SPARK-31177:
------------------------------------

[~markwaddle] , [~maropu] how was this resolved? 

> DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has 
> non-".gz" extension
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31177
>                 URL: https://issues.apache.org/jira/browse/SPARK-31177
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.4
>            Reporter: Mark Waddle
>            Priority: Major
>              Labels: bulk-closed
>
> i have large CSV files that are gzipped and uploaded to S3 with 
> Content-Encoding=gzip. the files have file extension ".csv", as most web 
> clients will automatically decompress the file based on the Content-Encoding 
> header. using pyspark to read these CSV files does not mimic this behavior.
> works as expected:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
> {code}
> does not decompress and tries to load entire contents of file as the first 
> row:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv', header=True)
> {code}
> it looks like it's relying on the file extension to determine if the file is 
> gzip compressed or not. it would be great if S3 resources, and any other http 
> based resources, could consult the Content-Encoding response header as well.
> i tried to find the code that determines this, but i'm not familiar with the 
> code base. any pointers would be helpful. and i can look into fixing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to