Hey Sreeram,

I feel like some of your points about why there are row groups are valid,
but there are some really good reasons why you might want to have multiple
row groups in a file (and I can share some situations where it can be
valuable).

If you think historically about how distributed processing worked, it was
breaking files up (rather than processing one big file) to achieve
parallelism.  However, even in that case, you may want to either combine or
split files to achieve some desired amount of parallelism.  A lot of files
are splittable either arbitrarily or by rather granular levels (e.g. text
files or Sequence files).  However, columnar formats don't have that same
luxury due to the physical layout and boundaries become the product of a
batch (or group) of records.  This is where the row group concept comes
from.  While it does help in many ways to align with HDFS blocks, a row
group can be considered the smallest indivisible (non-splittable) unit
within a parquet file.  This means you cannot efficiently split a row
group, but if you have multiple row groups per file, you can maintain large
files, but maintain flexibility with how you combine or split them to
achieve desired parallelism.

This has a number of applications.  For example, if you want fewer files
overall (e.g., due to the size of a dataset) you could have 128MB row
groups, but 1GB files.  This would allow you to split that file and
individual tasks would only be required to process the size of the row
group.  Alternatively, if you made the entire file a row group, a task
would be required to process the full 1GB.  This will have an impact on how
you can leverage parallelism to improve overall wall-clock performance.
Iceberg even improces upon traditional split planning because it tracks the
row group boundaries in iceberg metadata for more efficient planning.

As for the two advantages you propose, I'm not really sure the first holds
true since the parquet footer holds all of the necessary offsets to load
the correct row groups and columns offsets.  I don't feel like the
complexity really changes significantly and to be efficient, you want to
selectively load only the ranges you need.

The second point about pruning also has advantages for multiple row groups
because the stats are stored in the footer as well and can be processed
across all the row groups so you can prune out specific row groups within
the file (one row group per file would actually be redundant with the stats
that iceberg keeps).

There are actually a number of real world cases where we have reduced the
row group size down to 16MB so that we can increases the parallelism and
take more advantage of pruning at multiple levels to get answers faster.
At the same time, you wouldn't want to necessarily reduce the file size to
achieve the same result because it will produce significantly more files.

Just some things to consider before you head down that path,
-Dan



On Thu, Jul 1, 2021 at 2:38 PM Sreeram Garlapati <gsreeramku...@gmail.com>
wrote:

> Hello Iceberg devs,
>
> We are leaning towards having 1 RowGroup Per File. We would love to know
> if there are any additional considerations - that we potentially would have
> missed.
>
> *Here's my understanding on How/Why Parquet historically needed to hold
> multiple Row Groups - more like the major reason:*
>
>    1. HDFS had a single name node. This created a bottleneck - for the
>    operation handled by name node (i.e., maintaining that file address table)
>    w.r.to resolving file name to location. So, naturally HDFS world
>    created very large file sizes - in GBs.
>    2. So, in that world, to keep the file scan efficient - RowGroups were
>    introduced - so that Stats can be maintained within a given file - which
>    can help push the predicates down inside a given file to optimize/avoid
>    full file scan, where applicable. Rowgroups are also configured to the size
>    of HDFS block size to keep the reads/seeks efficient.
>
> *Here's why I feel this additional RowGroup concept is redundant:*
> In the new world where storage layer is housed in Cloud Blob stores - this
> bottleneck on file address tables is no longer present - as - behind the
> scenes it is typically a distributed hash table.
> ==> So, modelling a very large file is NOT a requirement anymore.
> ==> This concept of File having Multiple RowGroups - is not really useful.
> ==> we might very well simply create 1 Rowgroup per File
> ==> & ofcourse, we will still need to create reasonably big file sizes
> (for ex: 256mb) depending on the overall data on a given table - to let
> columnar/rle goodness kick-in.
>
> Added advantages of this are:
>
>    1. breaking down a v.large file into pieces to upload and download
>    from filestores needs state maintenance at client and service which makes
>    it complex & errorprone.
>    2. having only file level stats also puts the Iceberg metadata layer
>    into very good use w.r.to file pruning.
>
> Due to the above reasons - we are leaning towards creating 1 RowGroup per
> File - when we are creating the iceberg table.
>
> Would love to know your thoughts!
> Sreeram
>

Reply via email to