Mark Waddle created SPARK-31177:
-----------------------------------

             Summary: DataFrameReader.csv incorrectly reads gzip encoded CSV 
from S3 when it has non-".gz" extension
                 Key: SPARK-31177
                 URL: https://issues.apache.org/jira/browse/SPARK-31177
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.4.4
            Reporter: Mark Waddle


i have large CSV files that are gzipped and uploaded to S3 with 
Content-Encoding=gzip. the files have file extension ".csv", as most web 
clients will automatically decompress the file based on the Content-Encoding 
header. using pyspark to read these CSV files does not mimic this behavior.

works as expected:
{code:java}
df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
{code}
does not decompress and tries to load entire contents of file as the first row:
{code:java}
df = spark.read.csv('s3://bucket/large.csv', header=True)
{code}

it looks like it's relying on the file extension to determine if the file is 
gzip compressed or not. it would be great if S3 resources, and any other http 
based resources, could consult the Content-Encoding response header as well.

i tried to find the code that determines this, but i'm not familiar with the 
code base. any pointers would be helpful. and i can look into fixing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to