Re: Parquet sync starting now

Wes McKinney Mon, 14 Aug 2017 20:59:40 -0700

I have not taken a look at the performance of different compression
algorithms yet. Are there any example datasets that anyone would like
to see statistics for? Otherwise I will generate some high and low
entropy datasets with dictionary encoding disabled (so that the
compression is handled more by the byte compressors than by
dictionaries).


On Fri, Aug 11, 2017 at 8:27 PM, Julien Le Dem <[email protected]> wrote:
> Sorry for the delay. See notes bellow.
> I'm on vacation next week and Lars will send an invitation for the next sync
>  August 16th.
> Pooja will talk about her work on page indices.
> Here are the notes from last sync:
>
> Parquet Sync Aug 2 2017
>
>
> Anna (Cloudera):
>
> Deepak (Vertica): timestamp format
>
> Jim (Cloudera): Bloom filters
>
> Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes
>
> Marcel: index page proposal
>
> Ryan (Netflix): Merge
>
> Zoltan (Cloudera Budapest)
>
> JunJie (Intel): Bloom Filter.
>
> Julien: Bloom Filters
>
>
> Bloom Filters:
>
>  - to be efficient, needs 1 byte per distinct value.
>
>    - useful if many MDVS that are bigger than 1 byte (example UUIDs)
>
>  - Benchmarking:
>
>    - difficulty enabling dictionary filtering in Hive and spark sql:
> https://issues.apache.org/jira/browse/PARQUET-1061
>
>       - Ryan to follow up on how to configure it
>
>  - hashing discussion:
>
>    - We will used block based hashing algorithm.
>
>    - false positive > 00.1%
>
>    - Definition of hash function:
>
>       - currently has only one (Murmur3).
>
>       - TODO: define metadata using union to allow for other hash functions
> in the future
>
>       - TODO: clarify what variation of Murmur3 we are using.
>
>
> Index pages:
>
>  - good IO savings by skipping pages.
>
>  - if columns
>
>  - added metadata for position of dictionary location.
>
>  - Next time presentation of the result.
>
>
> Timestamp Format:
>
>  - Ryan to update the PR with conclusion
>
>
> Feedback on Brotli:
>
>  - why not LZ4 or ZStandard?
>
>  - Wes to try ou to compare in C++
>
>  - Ryan to compare in Java with his datasets.
>
>  - For reference:
>
>    - comparison graphs, including brotli vs. zstd:
> https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
>
>    -
> http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html
>
>
> PGP keys size:
>
>  - Use larger PGP key id to avoid collision:
>
>
> Github integration:
>
>  - Use new Apache - Github integration to allow admin rights on Github.
>
>  - Start a thread
>
> On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈 <[email protected]> wrote:
>
>> Hi Julien
>> Do we have meeting minutes for sync up?  I can't hear clearly from handout
>> due to vpn issue from home.
>>
>> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <[email protected]>:
>>
>> > on hangout:
>> > https://hangouts.google.com/hangouts/_/calendar/
>> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>> >
>>
>>
>>
>> --
>> Thanks & Best Regards
>>

Re: Parquet sync starting now

Reply via email to