Github user steveloughran commented on the pull request:
https://github.com/apache/spark/pull/11806#issuecomment-198322447
Summary: use an optimised storage format and dataframes, worry about
compression afterwards
1. you need to use a compression code that lets you seek into a bit of the
file before you read; this is what's needed for parallel access to data in
different blocks. I don't remember which codecs are best for this or the
specific switch to turn it on. LZO? Snappy?
1. Compression performance? With native libs it decompression can be fast;
snappy has a good reputation there. Without the native libs things will be way
slow. With the right settings, the performance costs of decompression are less
than the wait times for the uncompressed data (that's on HDD; SSD changes the
rules again & someone needs to do the experiments.)
1. Except for the specific case of the ingest phase, start by converting
your data into an optimised format: Parquet, ORC, which does things like
columnar storage and compression, This massively minimises IO when working with
a few columns (provided the format/API supports predicate pushdowns...use the
dataframes API & not RDDs directly). I *assume* you can compress these too,
though with compression already happening on columns, gains would be less.
1. Ingress is a good streaming usecase, if you haven't played with it
already ...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]