t oo created SPARK-30251:
----------------------------
Summary: faster way to read csv.gz?
Key: SPARK-30251
URL: https://issues.apache.org/jira/browse/SPARK-30251
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 2.4.4
Reporter: t oo
some data providers give files in csv.gz (ie 1gb compressed which is 25gb
uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed
which is 2.5gb uncompressed), now when i tell my boss that famous big data tool
spark takes 16hrs to convert the 1gb compressed into parquet then there is look
of shock. this is batch data we receive daily (80gb compressed, 2tb
uncompressed every day spread across ~300 files).
i know gz is not splittable so currently loaded on single worker. but we dont
have space/patience to do a pre-conversion to bz2 or uncompressed. can spark
have a better codec? i saw posts mentioning even python is faster than spark
[https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]
[https://github.com/nielsbasjes/splittablegzip]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]