Notes:
Parquet Sync Sept 13 2017:
Lars (Impala Cloudera - CA): want feedback on Puja’s pull request for page
index
Anna (Cloudera - Hungary)
Jim (Cloudera - CA): Bloom Filters
Ryan (Netflix - CA): parquet-cli zstd/lz4 to try out. Parquet format
release, logical type PR.
Junjie (Intel - Shanghai): Bloom filter status
Bikramjeet (Cloudera Impala - CA): clarify specification for column stats
and type for min/max storage
Wes (Twosigma - NY): C++
Julien (CA): patch release of parquet-mr
TZs: GMT-8, GMT-5, GMT+1, GMT+8
Time: 9am (SF), 12am (NY), 6pm (Budapest), 1am (Shanghai) !
- Bloom Filter:
- Junjie submitted pull request for parquet-format and parquet-mr. bloom
filter utility + tests.
- https://github.com/apache/parquet-format/pull/62/files
- not to be merged right away but feedback
- https://github.com/apache/parquet-mr/pull/425/files
- to move to package protected or tests to start incremental merge
without making it public
- Need review: Ryan, Julien, Jim
- compatibility, integration tests?
- old compatibility test repo:
https://github.com/Parquet/parquet-compatibility
- Arrow integration tests:
https://github.com/apache/arrow/tree/master/integration
- Action: Anna, Lars to follow up with Cloudera
Build: travis-ci broken with latest linux thrift-7 incompatibility
- parquet-mr should move to thrift-9: PARQUET-1103
- pin thrift to fixed version in build like in parquet-format.
- Page Index: https://github.com/apache/parquet-format/pull/63
- Action review by end of next week: Julien, Ryan, Marcel
- TODO (Lars?): move design doc to markdown in parquet-format
- should add (brief) comments in thrift definition (clarify in review)
- zstd/lz4:
- Ryan has e version of parquet-cli working with zstd, lz4 and brotli
for experimentation
- building with zstd backported was difficult. (provides hadoop jar)
- anyone interested in running their own tests?
- Lars to check at Cloudera.
- Ryan to send out on the list
- Wes built benchmarking fixtures in Cpp. todo write tests.
- use some shareable dataset for validation (NY Taxi dataset?).
- Logical type PR: https://github.com/apache/parquet-format/pull/51
- TODO: feedback
- reviewers: Julien
- clarification of min max storage:
-
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L215
- format of min and max values is the same as defined by the type.
- making releases:
- want a parquet-format release for:
- logical types (not merged yet)
- page indexes (not merged yet)
- sort order (merged)
- we won’t block on bloom filter. We can make another release as soon as
it is ready.
- Ryan to run the parquet-format release.
- need volunteer for parquet-mr release.
On Wed, Sep 13, 2017 at 8:58 AM, Julien Le Dem <[email protected]>
wrote:
> The Parquet sync is starting now at:
> https://meet.google.com/ent-mvhf-twr
>
> On Tue, Sep 12, 2017 at 8:55 PM, Julien Le Dem <[email protected]>
> wrote:
>
>> +1
>>
>> On Mon, Sep 11, 2017 at 8:36 PM, Lars Volker <[email protected]> wrote:
>>
>>> There were no objections so I sent out a meeting invite to everyone who
>>> was
>>> on the last invite. If you'd like to participate, too, please reply to
>>> this
>>> email.
>>>
>>> Cheers, Lars
>>>
>>> On Mon, Sep 11, 2017 at 11:06 AM, Ryan Blue <[email protected]>
>>> wrote:
>>>
>>> > That works for me.
>>> >
>>> > On Mon, Sep 11, 2017 at 7:55 AM, Lars Volker <[email protected]> wrote:
>>> >
>>> > > Hi All,
>>> > >
>>> > > I'd like to propose to have the next Parquet Sync on Wednesday, Sep
>>> 13th,
>>> > > at 9am PST. Possible topics would be the pull request to add a page
>>> index
>>> > > to the format, ongoing work on bloom filters.
>>> > >
>>> > > If Wednesday does not work for you, please propose another date and
>>> time.
>>> > > Otherwise I'll send out a MR later today.
>>> > >
>>> > > Cheers, Lars
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>>
>>
>>
>