Notes: Anna (Cloudera): Bloom filter update, Iceberg Gabor, Nandor (Cloudera):
- Value skipping implementation to be reviewed. Move Java code from parquet-format to parquet-mr. PR ready - How can users of Parquet handle timestamps and TZs. Allow for writing timestamp in java. Refactor original type logic to more flexible new original type api. - Column indexes and alignment of pages - Limiting the number of records in a page to avoid skewed splits when compression is really good. Ryan (Netflix): Iceberg stuff back to Parquet: expression library for push down. Dictionary and stats based row group filtering. JunJie (Intel): Bloom filter. Need more reviews. Have a vote on the design and add it to parquet-format. Julien (Wework): Encryption. - Bloom Filter: https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41 <https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41?filter=allopenissues> - - Committed utility class to parquet-cpp - Uploaded the benchmark result. - Ready to add into the spec. - Submit a PR for the parquet reader spec. - *Action*: review parquet java utility class. https://github.com/apache/parquet-mr/pull/425 - Encryption: - - Nandor, Gabor reviewing. - Apis to allow pluggable key management. - Need to have a proper review of the spec. - Need more testing - Column indices: - - PR to be reviewed: https://github.com/apache/parquet-mr/pull/514 - Ryan: to review features branch - Moving java code from parquet-format to parquet-mr: - - Action: review. https://github.com/apache/parquet-mr/pull/517 - Gets the thrift file from the parquet-format released artifact. - Maximum number of records per page: - - We should add a property with a maximum number of records per page and per row group. - Need to benchmark to figure out a good default. 10K? - Iceberg: - - Some of the iceberg code should be in Parquet: - - Rewrote record reconstruction stack - - Reuses page reader and decoder - Then does a triple iterator that return an entire column in a file (iterator of triples) - Record reconstruction class that handles everything that the current one does but with {list, map} factories - - 20% faster to write, 5% faster to read - Easier to write object mappers - Helps with page level skipping. - High level abstractions in the iceberg library: - - Take an expression and simplify it (not, ...) to run on metadata - Take a complex expression and split the part on the partition/min/max and the remaining part. On Mon, Aug 27, 2018 at 4:56 AM, Nandor Kollar <nkol...@cloudera.com.invalid > wrote: > Yes, CEST. > > On Mon, Aug 27, 2018 at 1:01 PM, Uwe L. Korn <uw...@xhochy.com> wrote: > > Hello Nador, > > > > probably I can make this time. Just a timezone question: Is it 6pm CET > or 6pm CEST? I guess the latter. > > > > See http://timesched.pocoo.org/?date=2018-08-28&tz=central- > europe-standard-time!,pacific-standard-time&range=1080,1140 > > > > Uwe > > > > On Mon, Aug 27, 2018, at 12:20 PM, Nandor Kollar wrote: > >> Hi All, > >> > >> As discussed on last Parquet sync, I propose to have an other meeting > >> on August 28th, at 6pm CET / 9 am PST to discuss those topic which we > >> didn't have time on the sync at August 15th, and of course any new > >> topic too. > >> > >> Sorry for the late notice, feel free to propose other time slot if is > >> is not suitable for you! Calendar entry to follow. > >> > >> Regards, > >> Nandor >