Re: Parquet sync uo

Uwe Korn Mon, 16 May 2016 10:54:28 -0700

Can you add me with [email protected] to the Sync google calendar so Iget notified?


Cheers
Uwe



On 16.05.16 18:20, Julien Le Dem wrote:

Wes: I maintain a google calendar invite. People can send me their email
address to be notified of the sync. Otherwise I send reminders on the dev
list, but it looks like last time I missed sending an earlier reminder.

Cheng: On the Parquet side for vectorization, you can always bypass the
assembly and access the column readers directly. Nezih/Ryan/Dan have some
work done around this with Presto. Other projects Like Drill or Spark have
a custom reader based on the column readers. We're discussing making a
shared implementation in Parquet itself.

On Mon, May 16, 2016 at 12:32 AM, Xu, Cheng A <[email protected]> wrote:

Hi,
Looks the vectorization is still undergoing. And I'd like to support Hive
vectorization for parquet. Is there any early vectorization feature ready
version of Parquet I could use to continue the work in Hive side? Thank you
in advance.

-----Original Message-----
From: Julien Le Dem [mailto:[email protected]]
Sent: Friday, May 13, 2016 8:34 AM
To: [email protected]
Subject: Re: Parquet sync uo

The next sync up will be around Strata London early June, where I'll
happen to be. We will do in the morning Pacific time, evening Europe time.

Notes from this sync:

attendees:
  - Julien (Dremio)
  - Alex, Piyush (Twitter)
  - Ryan (Netflix)


  Parquet 2.0 encodings discussion:

  - Jira open to finalize encodings: PARQUET-588: 2.0 encodings
finalization.

  - Ryan is doing experiments to measure efficiency on their data

- Alex and Piyush are looking at encoding selection strategies: How to
pick the best encoding for the data automatically


1.9 release:

  - last blocker: PARQUET-400 (readFully() behavior) needs update from
Jason. Possibly Piyush could pick it up if Jason is busy


Brotli integration.

- Ryan has been working on Brotli compression algorithm integration

- for similar compression cost as snappy, much better compression ratio

- embeds native library similar to snappy integration

- looking into possibly statically linking the native library

- PR available on parquet-format and parquet-mr


Vectorized read:

  - towards end of June we will organize a Parquet vectorized read
hackathon for all parties interested (make yourself known if interested,
we'll send more details later, possible remote participation through
hangout)


Lazy projections at runtime.

  - Alex has been looking into lazy thrift object for parquet-thrift to
minimize assembly cost in scalding existing jobs that don't declare the
columns they need.


Next sync will be in the morning PT.







On Thu, May 12, 2016 at 5:42 AM, Deepak Majeti <[email protected]>
wrote:

I am sorry for missing this meeting as well.
My interest is also to improve parquet-cpp reader/writer performance.
I will work with Uwe and Wes on this.
My other interest is on supporting predicate pushdown.  I will work on
this in parallel with performance.

Thanks!

On Thu, May 12, 2016 at 4:05 AM, Uwe Korn <[email protected]> wrote:

I'm sorry I wasn't able to join today again (traveling). We could
choose an early time Pacific time to make the meeting accessible to
both Asia and Europe -- I would suggest 8 or 9 AM Pacific

8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable.
Also: Do we have a calendar where I can see in advance when sync ups

are?

Currently I'm working on the Parquet integration with Arrow and on

building

a Python interface for libarrow-parquet. Once we have a basic
working version, I will look into implementing missing features in
the writer and improving general read/write performance in parquet-cpp.

Uwe

http://timesched.pocoo.org/?date=2016-05-11&tz=pacific-standard-tim
e

!,de:berlin,cn:shanghai,us:new-york-city:ny

I did not have much time for writing Parquet C++ development the
last
6 weeks, but plan to help Uwe complete the writer implementation
and work toward a more complete Apache Arrow integration (this is
in progress here:
https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet)

Other items of immediate interest

- C++ API to the file metadata (read + write)
- Conda packaging for built artifacts (to make parquet-cpp easier
for Python programmers to install portably when the time comes). I
got Thrift C++ into conda-forge this week so this should not be
hard now https://github.com/conda-forge/thrift-cpp-feedstock
- Expanding column scan benchmarks (thanks Uwe for kickstarting the
benchmarking effort!)
- Perf improvements for the RLE decoder

Thanks
Wes

On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem <[email protected]>

wrote:

The actual hangout url is
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem <[email protected]>

wrote:

starting in 5 mins:
https://plus.google.com/hangouts/_/event/parquet_sync_up

On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem
<[email protected]>
wrote:

It is happening at 4pm PT on google hangout
https://plus.google.com/hangouts/_/event/parquet_sync_up

(we can do a different time next time, based on timezone

preferences.

Afternoon is better for Asia. Morning is better for Europe)

--
Julien


--
Julien


--
Julien



--
regards,
Deepak Majeti



--
Julien

Re: Parquet sync uo

Reply via email to