Question on record_count field in the data-file entry of a manifest file

Vivekanand Vellanki Wed, 07 Apr 2021 01:41:45 -0700

Hi,

We are in the process of converting Hive datasets to Iceberg datasets.


In this process, we noticed that each data-file entry in the manifest file
has a required record_count field.

Populating this accurately would require reading the footer/tail for
Parquet/ORC files. For AVRO files, it requires reading the block headers
for all blocks to determine the number of records in the AVRO file.

Is the record_count in the data-file entry expected to be accurate? or can
we estimate it based on size of the file and an estimation of a row size?

Thanks
Vivek

Question on record_count field in the data-file entry of a manifest file

Reply via email to