Hi,

>From my experience, and from all the benchmarks I did and read- snappy
provides much bigger file size compared to zstd, while cpu usage is similar
for both - in most cases not really noticeable.

We switched to ZSTD and our CPU usage did not increase in a noticeable
manner (maybe an increase of usage of 1-2%, if at all) , while file sizes
dropped by ~35%.
It depends on the data you compress and the hardware you use so there is no
real alternative to trial and error, but for us I can say ZSTD saved a lot
of money...

Most benefit in terms of speed indeed will come from skipping data you
don't need to read - and the best way to achieve that is by not using
parquet directly, but using open table formats such as Iceberg and Delta.
For instance, in Delta you can gather delta statistics on the most used for
filtering columns, and that way you can get file skipping on files that are
not relevant to your specific query, on top of skipping by partition, and
read much less data.

HTH,
Nimrod

בתאריך יום ג׳, 26 באוג׳ 2025, 22:38, מאת Nikolas Vanderhoof ‏<
nikolasrvanderh...@gmail.com>:

> Thank you for the detailed response. This is helpful. I’ll read your
> article, and test my data as you’ve described.
>
> On Tue, Aug 26, 2025 at 3:05 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi Nikolas,
>>
>> *Why Spark defaults to Snappy for Parquet.* In analytics scans the
>> bottleneck is usually *CPU to decompress Parquet pages*, not raw I/O.
>> Snappy gives *very fast decode* at a decent ratio, so end-to-end query
>> latency is typically better than heavier codecs like GZIP. For colder data,
>> GZIP (or ZSTD) can make sense if you’re chasing storage savings and can
>> afford slower reads.
>>
>> Two different codec decisions to make
>>
>>    1.
>>
>>    Intermediates (shuffle/spill/broadcast) — speed > ratio
>>    I keep fast codecs here; changing them rarely helps unless the
>>    network/disk is the bottleneck and I have spare CPU:
>>
>>    *spark.conf.set("spark.shuffle.compress", "true")
>>    spark.conf.set("spark.shuffle.spill.compress", "true")
>>    spark.conf.set("spark.io.compression.codec", "lz4")   // snappy or zstd 
>> are also viable
>>    *
>>
>>    2.
>>
>>    Storage at rest (final Parquet in the lake/lakehouse) — pick by hot
>>    vs cold
>>    -
>>
>>       *Hot / frequently scanned:* *Snappy* for fastest reads.
>>       -
>>
>>       *Cold / archival:* *GZIP* (or try *ZSTD*) for much smaller files;
>>       accept slower scans.
>>
>>    *spark.conf.set("spark.sql.parquet.compression.codec", "snappy") // or 
>> "gzip" or "zstd"*
>>
>>
>> This mirrors what I wrote up for *BigQuery external Parquet on object
>> storage *as attached (different engine, same storage trade-off): I used 
>> *Parquet
>> + GZIP* when exporting to Cloud Storage (great size reduction) and noted
>> that *external tables read slower than native*—so I keep hot data
>> “native” and push colder tiers to cheaper storage with heavier compression.
>> In that piece, a toy query ran ~*190 ms* on native vs ~*296 ms* on the
>> external table (≈43% slower), which is the kind of latency gap you trade
>> for cost/footprint savings on colder data .
>>
>> *Bigger levers than the codec*
>> The codec choice matters, but *reading fewer bytes* matters more! In my
>> article I lean heavily on *Hive-style partition layouts* for external
>> Parquet (multiple partition keys, strict directory order), and call out
>> gotchas like keeping *non-Parquet junk out of leaf directories *(external
>> table creation/reads can fail/slow if the layout’s messy) .
>>
>> How I would benchmark on your data
>> Write the same dataset three ways (snappy, gzip, zstd), then measure:
>>
>>    -
>>
>>    total bytes on storage,
>>    -
>>
>>    Spark SQL *scan time* and *CPU time* in the UI,
>>    -
>>
>>    effect of *partition pruning* with realistic filters.
>>    Keep the shuffle settings fast (above) so you’re testing scan costs,
>>    not an artificially slow shuffle.
>>
>> My rules of thumb
>>
>>    -
>>
>>    If *latency* and interactive work matter → *Snappy* Parquet.
>>    -
>>
>>    If *storage $$* dominates and reads are rare → *GZIP* (or *ZSTD* as a
>>    middle ground).
>>    -
>>
>>    Regardless of codec, *partition pruning + sane file sizes* move the
>>    needle the most (that’s the core of my “Hybrid Curated Storage” approach)
>>
>> HTH
>>
>> Regards
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>> (P.S. The background and examples I referenced are from my article on
>> using *GCS external Parquet* with *Snappy/GZIP/ZSTD* and Hive
>> partitioning for cost/perf balance—feel free to skim the compression/export
>> and partitioning sections.)
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> On Tue, 26 Aug 2025 at 17:59, Nikolas Vanderhoof <
>> nikolasrvanderh...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Why does Spark use Snappy by default when compressing data within
>>> Parquet? I’ve read that when shuffling, speed is prioritized above
>>> compression ratio. Is that true, and are there other things to consider?
>>>
>>> Also, are there any recent benchmarks that the community has performed
>>> that evaluate the performance of Spark when using Snappy compared to other
>>> codecs? I’d be interested not only in the impact when using other codecs
>>> for the intermediate and shuffle files, but also for the storage at rest.
>>> For example, I know there are different configuration options that allow me
>>> to set the codec for these internal files, or for the final parquet files
>>> stored in the lakehouse.
>>>
>>> Before I decide to use a codec other than the default in my work, I want
>>> to understand any tradeoffs better.
>>>
>>> Thanks,
>>> Nik
>>>
>>

Reply via email to