Notes:
Parquet Sync Sept 13 2017:

Lars (Impala Cloudera - CA): want feedback on Puja’s pull request for page
index
Anna (Cloudera - Hungary)
Jim (Cloudera - CA): Bloom Filters
Ryan (Netflix - CA): parquet-cli zstd/lz4 to try out. Parquet format
release, logical type PR.
Junjie (Intel - Shanghai): Bloom filter status
Bikramjeet (Cloudera Impala - CA): clarify specification for column stats
and type for min/max storage
Wes (Twosigma - NY): C++
Julien (CA): patch release of parquet-mr

TZs: GMT-8, GMT-5, GMT+1, GMT+8
Time: 9am (SF), 12am (NY), 6pm (Budapest), 1am (Shanghai) !

 - Bloom Filter:
- Junjie submitted pull request for parquet-format and parquet-mr. bloom
filter utility + tests.
    - https://github.com/apache/parquet-format/pull/62/files
        - not to be merged right away but feedback
    - https://github.com/apache/parquet-mr/pull/425/files
        - to move to package protected or tests to start incremental merge
without making it public
    - Need review: Ryan, Julien, Jim
- compatibility, integration tests?
    - old compatibility test repo:
https://github.com/Parquet/parquet-compatibility
    - Arrow integration tests:
https://github.com/apache/arrow/tree/master/integration
    - Action: Anna, Lars to follow up with Cloudera

Build: travis-ci broken with latest linux thrift-7 incompatibility
 - parquet-mr should move to thrift-9: PARQUET-1103
 - pin thrift to fixed version in build like in parquet-format.

 - Page Index: https://github.com/apache/parquet-format/pull/63
   - Action review by end of next week: Julien, Ryan, Marcel
   - TODO (Lars?): move design doc to markdown in parquet-format
   - should add (brief) comments in thrift definition (clarify in review)

 - zstd/lz4:
   - Ryan has e version of parquet-cli working with zstd, lz4 and brotli
for experimentation
   - building with zstd backported was difficult. (provides hadoop jar)
   - anyone interested in running their own tests?
   - Lars to check at Cloudera.
   - Ryan to send out on the list
   - Wes built benchmarking fixtures in Cpp. todo write tests.
   - use some shareable dataset for validation (NY Taxi dataset?).

 - Logical type PR: https://github.com/apache/parquet-format/pull/51
- TODO: feedback
- reviewers: Julien

 - clarification of min max storage:
   -
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L215
   - format of min and max values is the same as defined by the type.

- making releases:
  - want a parquet-format release for:
    - logical types (not merged yet)
    - page indexes (not merged yet)
    - sort order (merged)
  - we won’t block on bloom filter. We can make another release as soon as
it is ready.
  - Ryan to run the parquet-format release.
  - need volunteer for parquet-mr release.



On Wed, Sep 13, 2017 at 8:58 AM, Julien Le Dem <[email protected]>
wrote:

> The Parquet sync is starting now at:
> https://meet.google.com/ent-mvhf-twr
>
> On Tue, Sep 12, 2017 at 8:55 PM, Julien Le Dem <[email protected]>
> wrote:
>
>> +1
>>
>> On Mon, Sep 11, 2017 at 8:36 PM, Lars Volker <[email protected]> wrote:
>>
>>> There were no objections so I sent out a meeting invite to everyone who
>>> was
>>> on the last invite. If you'd like to participate, too, please reply to
>>> this
>>> email.
>>>
>>> Cheers, Lars
>>>
>>> On Mon, Sep 11, 2017 at 11:06 AM, Ryan Blue <[email protected]>
>>> wrote:
>>>
>>> > That works for me.
>>> >
>>> > On Mon, Sep 11, 2017 at 7:55 AM, Lars Volker <[email protected]> wrote:
>>> >
>>> > > Hi All,
>>> > >
>>> > > I'd like to propose to have the next Parquet Sync on Wednesday, Sep
>>> 13th,
>>> > > at 9am PST. Possible topics would be the pull request to add a page
>>> index
>>> > > to the format, ongoing work on bloom filters.
>>> > >
>>> > > If Wednesday does not work for you, please propose another date and
>>> time.
>>> > > Otherwise I'll send out a MR later today.
>>> > >
>>> > > Cheers, Lars
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>>
>>
>>
>

Reply via email to