YARN preemption and partial Spark S3 writes

Jonathan Bender Wed, 19 Oct 2022 10:12:36 -0700

Hello,

We're increasing our use of YARN preemption on our Hadoop clusters, and
we've noticed a significant uptick in orphaned data (ie. data that isn't
associated with an Iceberg table but was written from Spark executors in
the same app). We suspect it could be due to partially written but
uncommitted data which doesn't get propagated to the driver before the
container is preempted.


We'll continue to investigate on our side but I wanted to confirm what the
best option is here, or whether that's expected behavior at all. Should we
be using magic comitters in S3A to stage data with multipart before the
commit? Or is this handled by some other Iceberg construct?

We're on a slight fork of Iceberg 0.12, Spark 2.4/3.1. Thanks in advance!

Jon

YARN preemption and partial Spark S3 writes

Reply via email to