t oo created SPARK-30251:
----------------------------

             Summary: faster way to read csv.gz?
                 Key: SPARK-30251
                 URL: https://issues.apache.org/jira/browse/SPARK-30251
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: t oo


some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
which is 2.5gb uncompressed), now when i tell my boss that famous big data tool 
spark takes 16hrs to convert the 1gb compressed into parquet then there is look 
of shock. this is batch data we receive daily (80gb compressed, 2tb 
uncompressed every day spread across ~300 files).

i know gz is not splittable so currently loaded on single worker. but we dont 
have space/patience to do a pre-conversion to bz2 or uncompressed. can spark 
have a better codec? i saw posts mentioning even python is faster than spark

 

[https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]

[https://github.com/nielsbasjes/splittablegzip]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to