Thanks for responding Anton! Do we think the delay is mainly due to lower/upper bound filtering? have you faced this? I haven't exactly found where the slowness is yet. It's generally due to the stats filtering but what part of it is causing this much network traffic. There's CloseableIteratable that takes a ton of time on the next() and hasNext() calls. My guess is the expression evaluation on each manifest entry is what's doing it.
On Fri, Apr 19, 2019 at 1:41 PM Anton Okolnychyi <aokolnyc...@apple.com> wrote: > I think we need to have a list of columns for which we want to collect > stats and that should be configurable by the user. Maybe, this config > should be applicable only to lower/upper bounds. As we now collect stats > even for nested struct fields, this might generate a lot of data. In most > cases, users cluster/sort their data by a subset of data columns to have > fast queries with predicates on those columns. So, being able to configure > columns for which to collect lower/upper bounds seems reasonable. > > On 19 Apr 2019, at 08:03, Gautam <gautamkows...@gmail.com> wrote: > > > The length in bytes of the schema is 109M as compared to 687K of the > non-stats dataset. > > Typo, length in bytes of *manifest*. schema is the same. > > On Fri, Apr 19, 2019 at 12:16 PM Gautam <gautamkows...@gmail.com> wrote: > >> Correction, partition count = 4308. >> >> > Re: Changing the way we keep stats. Avro is a block splittable format >> and is friendly with parallel compute frameworks like Spark. >> >> Here I am trying to say that we don't need to change the format to >> columnar right? The current format is already friendly for parallelization. >> >> thanks. >> >> >> >> >> >> On Fri, Apr 19, 2019 at 12:12 PM Gautam <gautamkows...@gmail.com> wrote: >> >>> Ah, my bad. I missed adding in the schema details .. Here are some >>> details on the dataset with stats : >>> >>> Iceberg Schema Columns : 20 >>> Spark Schema fields : 20 >>> Snapshot Summary :{added-data-files=4308, added-records=11494037, >>> changed-partition-count=4308, total-records=11494037, total-data-files=4308} >>> Manifest files :1 >>> Manifest details: >>> => manifest file path: >>> adl://[dataset_base_path]/metadata/4bcda033-9df5-4c84-8eef-9d6ef93e4347-m0.avro >>> => manifest file length: 109,028,885 >>> => existing files count: 0 >>> => added files count: 4308 >>> => deleted files count: 0 >>> => partitions count: 4 >>> => partition fields count: 4 >>> >>> Re: Num data files. It has a single manifest keep track of 4308 files. >>> Total record count is 11.4 Million. >>> >>> Re: Columns. You are right that this table has many columns.. although >>> it has only 20 top-level columns, num leaf columns are in order of >>> thousands. This Schema is heavy on structs (in the thousands) and has deep >>> levels of nesting. I know Iceberg keeps >>> *column_sizes, value_counts, null_value_counts* for all leaf fields and >>> additionally *lower-bounds, upper-bounds* for native, struct types (not >>> yet for map KVs and arrays). The length in bytes of the schema is 109M as >>> compared to 687K of the non-stats dataset. >>> >>> Re: Turning off stats. I am looking to leverage stats coz for our >>> datasets with much larger number of data files we want to leverage >>> iceberg's ability to skip entire files based on these stats. This is one of >>> the big incentives for us to use Iceberg. >>> >>> Re: Changing the way we keep stats. Avro is a block splittable format >>> and is friendly with parallel compute frameworks like Spark. So would it >>> make sense for instance to have add an option to have Spark job / Futures >>> handle split planning? In a larger context, 109M is not that much >>> metadata given that Iceberg is meant for datasets where the metadata itself >>> is Bigdata scale. I'm curious on how folks with larger sized metadata (in >>> GB) are optimizing this today. >>> >>> >>> Cheers, >>> -Gautam. >>> >>> >>> >>> >>> On Fri, Apr 19, 2019 at 12:40 AM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> Thanks for bringing this up! My initial theory is that this table has a >>>> ton of stats data that you have to read. That could happen in a couple of >>>> cases. >>>> >>>> First, you might have large values in some columns. Parquet will >>>> suppress its stats if values are larger than 4k and those are what Iceberg >>>> uses. But that could still cause you to store two 1k+ objects for each >>>> large column (lower and upper bounds). With a lot of data files, that could >>>> add up quickly. The solution here is to implement #113 >>>> <https://github.com/apache/incubator-iceberg/issues/113> so that we >>>> don't store the actual min and max for string or binary columns, but >>>> instead a truncated value that is just above or just below. >>>> >>>> The second case is when you have a lot of columns. Each column stores >>>> both a lower and upper bound, so 1,000 columns could easily take 8k per >>>> file. If this is the problem, then maybe we want to have a way to turn off >>>> column stats. We could also think of ways to change the way stats are >>>> stored in the manifest files, but that only helps if we move to a columnar >>>> format to store manifests, so this is probably not a short-term fix. >>>> >>>> If you can share a bit more information about this table, we can >>>> probably tell which one is the problem. I'm guessing it is the large values >>>> problem. >>>> >>>> On Thu, Apr 18, 2019 at 11:52 AM Gautam <gautamkows...@gmail.com> >>>> wrote: >>>> >>>>> Hello folks, >>>>> >>>>> I have been testing Iceberg reading with and without stats built into >>>>> Iceberg dataset manifest and found that there's a huge jump in network >>>>> traffic with the latter.. >>>>> >>>>> >>>>> In my test I am comparing two Iceberg datasets, both written in >>>>> Iceberg format. One with and the other without stats collected in Iceberg >>>>> manifests. In particular the difference between the writers used for the >>>>> two datasets is this PR: >>>>> https://github.com/apache/incubator-iceberg/pull/63/files which uses >>>>> Iceberg's writers for writing Parquet data. I captured tcpdump from query >>>>> scans run on these two datasets. The partition being scanned contains 1 >>>>> manifest, 1 parquet data file and ~3700 rows in both datasets. There's a >>>>> 30x jump in network traffic to the remote filesystem (ADLS) when i switch >>>>> to stats based Iceberg dataset. Both queries used the same Iceberg reader >>>>> code to access both datasets. >>>>> >>>>> ``` >>>>> root@d69e104e7d40:/usr/local/spark# tcpdump -r >>>>> iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap | grep >>>>> perfanalysis.adlus15.projectcabostore.net | grep ">" | wc -l >>>>> reading from file iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap, >>>>> link-type EN10MB (Ethernet) >>>>> >>>>> *8844* >>>>> >>>>> >>>>> root@d69e104e7d40:/usr/local/spark# tcpdump -r >>>>> iceberg_scratch_pad_demo_11_batch_query.pcap | grep >>>>> perfanalysis.adlus15.projectcabostore.net | grep ">" | wc -l >>>>> reading from file iceberg_scratch_pad_demo_11_batch_query.pcap, >>>>> link-type EN10MB (Ethernet) >>>>> >>>>> *269708* >>>>> >>>>> ``` >>>>> >>>>> As a consequence of this the query response times get affected >>>>> drastically (illustrated below). I must confess that I am on a slow >>>>> internet connection via VPN connecting to the remote FS. But the dataset >>>>> without stats took just 1m 49s while the dataset with stats took 26m 48s >>>>> to >>>>> read the same sized data. Most of that time in the latter dataset was >>>>> spent >>>>> split planning in Manifest reading and stats evaluation. >>>>> >>>>> ``` >>>>> all=> select count(*) from iceberg_geo1_metrixx_qc_postvalues where >>>>> batchId = '4a6f95abac924159bb3d7075373395c9'; >>>>> count(1) >>>>> ---------- >>>>> 3627 >>>>> (1 row) >>>>> *Time: 109673.202 ms (01:49.673)* >>>>> >>>>> all=> select count(*) from iceberg_scratch_pad_demo_11 where >>>>> _ACP_YEAR=2018 and _ACP_MONTH=01 and _ACP_DAY=01 and batchId = >>>>> '6d50eeb3e7d74b4f99eea91a27fc8f15'; >>>>> count(1) >>>>> ---------- >>>>> 3808 >>>>> (1 row) >>>>> *Time: 1608058.616 ms (26:48.059)* >>>>> >>>>> ``` >>>>> >>>>> Has anyone faced this? I'm wondering if there's some caching or >>>>> parallelism option here that can be leveraged. Would appreciate some >>>>> guidance. If there isn't a straightforward fix and others feel this is an >>>>> issue I can raise an issue and look into it further. >>>>> >>>>> >>>>> Cheers, >>>>> -Gautam. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >