Jon, Iceberg doesn't use Hadoop committers, so those settings affect just your non-Iceberg tables. The increase in orphan data files is actually caused by Spark in this case because Spark doesn't have a way to register files you are writing, only files that have been written. As a result when a container that has completed a file gets preempted, the driver is never notified about the file.
Hadoop committers would avoid this by keeping data in a directory that the driver knows about, but it ends up being far worse to copy data around (twice!) than to keep orphan files cleaned up on a regular basis. Ryan On Wed, Oct 19, 2022 at 10:12 AM Jonathan Bender <[email protected]> wrote: > Hello, > > We're increasing our use of YARN preemption on our Hadoop clusters, and > we've noticed a significant uptick in orphaned data (ie. data that isn't > associated with an Iceberg table but was written from Spark executors in > the same app). We suspect it could be due to partially written but > uncommitted data which doesn't get propagated to the driver before the > container is preempted. > > We'll continue to investigate on our side but I wanted to confirm what the > best option is here, or whether that's expected behavior at all. Should we > be using magic comitters in S3A to stage data with multipart before the > commit? Or is this handled by some other Iceberg construct? > > We're on a slight fork of Iceberg 0.12, Spark 2.4/3.1. Thanks in advance! > > Jon > -- Ryan Blue Tabular
