That helps a lot! Thank you Szehon for the detailed response! ggg
On Fri, Jan 7, 2022 at 1:54 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Sure, I guessed you were asking about the number of manifest files rather > than entries. There's always a tradeoff, some aspects being: > > - More manifest files => better predicate pushdown (skip more manifest > files during query), and less chance for concurrency conflict (which is two > transaction trying to modify same manifest file, which leads to retry). > - Less manifest files => metadata queries (like show partitions) can > be faster. > > Each of these is a large topic itself that might be too big to go into > here :) > > For us, we find the benefit for more manifest file is not as important as > making the metadata query fast for our users. So we have tuned > commit.manifest.target-size-bytes to be a few times than the default. We > try to keep the manifest file count to be tens or hundreds for any table, > we find if there are thousands, then a 'show partition' query takes a long > time. > > We do need to do periodic RewriteManifest to keep the table in this shape > (as we have too many commits), and also to use > 'commit.manifest.min-count-to-merge' and 'commit.manifest-merge.enabled' to > do the merge on commit to keep the table in this shape. > > Hope that helps, > Szehon > > On Fri, Jan 7, 2022 at 1:10 PM g. g. grey <g.g.g...@gmail.com> wrote: > >> Hi Szehon, >> >> Thanks. My apologies; I was too loose in my wording. I'll try to use the >> terms from the spec. >> >> I was asking about the number of total manifest files, specifically the >> number of `manifest_file` structs that are found in the manifest-list file. >> >> It sounds like the "commit.manifest.target-size-bytes" controls the >> target size when we merge small manifest files, which is great to know we >> can configure, as it will clearly have an impact on the number of >> `manifest_file` structs. >> >> Is there a general order-of-magnitude target number of `manifest_file` >> structs? Presumably that would dictate when one would want to merge >> manifest files and/or data files. >> >> Thanks again! >> ggg >> >> >> On Fri, Jan 7, 2022 at 11:41 AM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> Hi, >>> >>> The manifest entries are one per data file or delete file, so depends >>> how many data files/delete files your table has. Number of files is >>> controlled mostly by the parallelism of the job that writes the table, >>> though there are Iceberg RewriteDataFile utilities that can compact as well >>> (as in your link). >>> >>> The number of manifest files is another topic, controlled by >>> "commit.manifest.target-size-bytes" >>> (but should not affect the number of total manifest entries). >>> >>> Hope that helps, >>> Szehon >>> >>> On Fri, Jan 7, 2022 at 9:39 AM g. g. grey <g.g.g...@gmail.com> wrote: >>> >>>> Hi folks, >>>> >>>> I am just getting started with Iceberg and I'm trying to build up some >>>> intuition for how large the metadata will become for large, active tables. >>>> Specifically, what is the order of magnitude of manifest entries that I >>>> should reasonably expect in a manifest-list file? Is there a particular >>>> range that is ideal and aimed for when cleaning up/maintaining a table? >>>> >>>> I found the maintenance page <https://iceberg.apache.org/#maintenance/>, >>>> but I'm hoping to find rules-of-thumb based on peoples' experience with >>>> using iceberg. >>>> >>>> Thanks! If I've missed the info somewhere, a simple pointer would be >>>> great. >>>> ggg >>>> >>>