Hey Sreeram, I feel like some of your points about why there are row groups are valid, but there are some really good reasons why you might want to have multiple row groups in a file (and I can share some situations where it can be valuable).
If you think historically about how distributed processing worked, it was breaking files up (rather than processing one big file) to achieve parallelism. However, even in that case, you may want to either combine or split files to achieve some desired amount of parallelism. A lot of files are splittable either arbitrarily or by rather granular levels (e.g. text files or Sequence files). However, columnar formats don't have that same luxury due to the physical layout and boundaries become the product of a batch (or group) of records. This is where the row group concept comes from. While it does help in many ways to align with HDFS blocks, a row group can be considered the smallest indivisible (non-splittable) unit within a parquet file. This means you cannot efficiently split a row group, but if you have multiple row groups per file, you can maintain large files, but maintain flexibility with how you combine or split them to achieve desired parallelism. This has a number of applications. For example, if you want fewer files overall (e.g., due to the size of a dataset) you could have 128MB row groups, but 1GB files. This would allow you to split that file and individual tasks would only be required to process the size of the row group. Alternatively, if you made the entire file a row group, a task would be required to process the full 1GB. This will have an impact on how you can leverage parallelism to improve overall wall-clock performance. Iceberg even improces upon traditional split planning because it tracks the row group boundaries in iceberg metadata for more efficient planning. As for the two advantages you propose, I'm not really sure the first holds true since the parquet footer holds all of the necessary offsets to load the correct row groups and columns offsets. I don't feel like the complexity really changes significantly and to be efficient, you want to selectively load only the ranges you need. The second point about pruning also has advantages for multiple row groups because the stats are stored in the footer as well and can be processed across all the row groups so you can prune out specific row groups within the file (one row group per file would actually be redundant with the stats that iceberg keeps). There are actually a number of real world cases where we have reduced the row group size down to 16MB so that we can increases the parallelism and take more advantage of pruning at multiple levels to get answers faster. At the same time, you wouldn't want to necessarily reduce the file size to achieve the same result because it will produce significantly more files. Just some things to consider before you head down that path, -Dan On Thu, Jul 1, 2021 at 2:38 PM Sreeram Garlapati <gsreeramku...@gmail.com> wrote: > Hello Iceberg devs, > > We are leaning towards having 1 RowGroup Per File. We would love to know > if there are any additional considerations - that we potentially would have > missed. > > *Here's my understanding on How/Why Parquet historically needed to hold > multiple Row Groups - more like the major reason:* > > 1. HDFS had a single name node. This created a bottleneck - for the > operation handled by name node (i.e., maintaining that file address table) > w.r.to resolving file name to location. So, naturally HDFS world > created very large file sizes - in GBs. > 2. So, in that world, to keep the file scan efficient - RowGroups were > introduced - so that Stats can be maintained within a given file - which > can help push the predicates down inside a given file to optimize/avoid > full file scan, where applicable. Rowgroups are also configured to the size > of HDFS block size to keep the reads/seeks efficient. > > *Here's why I feel this additional RowGroup concept is redundant:* > In the new world where storage layer is housed in Cloud Blob stores - this > bottleneck on file address tables is no longer present - as - behind the > scenes it is typically a distributed hash table. > ==> So, modelling a very large file is NOT a requirement anymore. > ==> This concept of File having Multiple RowGroups - is not really useful. > ==> we might very well simply create 1 Rowgroup per File > ==> & ofcourse, we will still need to create reasonably big file sizes > (for ex: 256mb) depending on the overall data on a given table - to let > columnar/rle goodness kick-in. > > Added advantages of this are: > > 1. breaking down a v.large file into pieces to upload and download > from filestores needs state maintenance at client and service which makes > it complex & errorprone. > 2. having only file level stats also puts the Iceberg metadata layer > into very good use w.r.to file pruning. > > Due to the above reasons - we are leaning towards creating 1 RowGroup per > File - when we are creating the iceberg table. > > Would love to know your thoughts! > Sreeram >