Re: Parquet sync uo

Julien Le Dem Mon, 16 May 2016 09:21:24 -0700

Wes: I maintain a google calendar invite. People can send me their email
address to be notified of the sync. Otherwise I send reminders on the dev
list, but it looks like last time I missed sending an earlier reminder.


Cheng: On the Parquet side for vectorization, you can always bypass the
assembly and access the column readers directly. Nezih/Ryan/Dan have some
work done around this with Presto. Other projects Like Drill or Spark have
a custom reader based on the column readers. We're discussing making a
shared implementation in Parquet itself.

On Mon, May 16, 2016 at 12:32 AM, Xu, Cheng A <[email protected]> wrote:

> Hi,
> Looks the vectorization is still undergoing. And I'd like to support Hive
> vectorization for parquet. Is there any early vectorization feature ready
> version of Parquet I could use to continue the work in Hive side? Thank you
> in advance.
>
> -----Original Message-----
> From: Julien Le Dem [mailto:[email protected]]
> Sent: Friday, May 13, 2016 8:34 AM
> To: [email protected]
> Subject: Re: Parquet sync uo
>
> The next sync up will be around Strata London early June, where I'll
> happen to be. We will do in the morning Pacific time, evening Europe time.
>
> Notes from this sync:
>
> attendees:
>  - Julien (Dremio)
>  - Alex, Piyush (Twitter)
>  - Ryan (Netflix)
>
>
>  Parquet 2.0 encodings discussion:
>
>  - Jira open to finalize encodings: PARQUET-588: 2.0 encodings
> finalization.
>
>  - Ryan is doing experiments to measure efficiency on their data
>
> - Alex and Piyush are looking at encoding selection strategies: How to
> pick the best encoding for the data automatically
>
>
> 1.9 release:
>
>  - last blocker: PARQUET-400 (readFully() behavior) needs update from
> Jason. Possibly Piyush could pick it up if Jason is busy
>
>
> Brotli integration.
>
> - Ryan has been working on Brotli compression algorithm integration
>
> - for similar compression cost as snappy, much better compression ratio
>
> - embeds native library similar to snappy integration
>
> - looking into possibly statically linking the native library
>
> - PR available on parquet-format and parquet-mr
>
>
> Vectorized read:
>
>  - towards end of June we will organize a Parquet vectorized read
> hackathon for all parties interested (make yourself known if interested,
> we'll send more details later, possible remote participation through
> hangout)
>
>
> Lazy projections at runtime.
>
>  - Alex has been looking into lazy thrift object for parquet-thrift to
> minimize assembly cost in scalding existing jobs that don't declare the
> columns they need.
>
>
> Next sync will be in the morning PT.
>
>
>
>
>
>
>
> On Thu, May 12, 2016 at 5:42 AM, Deepak Majeti <[email protected]>
> wrote:
>
> > I am sorry for missing this meeting as well.
> > My interest is also to improve parquet-cpp reader/writer performance.
> > I will work with Uwe and Wes on this.
> > My other interest is on supporting predicate pushdown.  I will work on
> > this in parallel with performance.
> >
> > Thanks!
> >
> > On Thu, May 12, 2016 at 4:05 AM, Uwe Korn <[email protected]> wrote:
> > >
> > >> I'm sorry I wasn't able to join today again (traveling). We could
> > >> choose an early time Pacific time to make the meeting accessible to
> > >> both Asia and Europe -- I would suggest 8 or 9 AM Pacific
> > >>
> > > 8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable.
> > > Also: Do we have a calendar where I can see in advance when sync ups
> are?
> > >
> > > Currently I'm working on the Parquet integration with Arrow and on
> > building
> > > a Python interface for libarrow-parquet. Once we have a basic
> > > working version, I will look into implementing missing features in
> > > the writer and improving general read/write performance in parquet-cpp.
> > >
> > > Uwe
> > >
> > >>
> > >> http://timesched.pocoo.org/?date=2016-05-11&tz=pacific-standard-tim
> > >> e
> > !,de:berlin,cn:shanghai,us:new-york-city:ny
> > >>
> > >> I did not have much time for writing Parquet C++ development the
> > >> last
> > >> 6 weeks, but plan to help Uwe complete the writer implementation
> > >> and work toward a more complete Apache Arrow integration (this is
> > >> in progress here:
> > >> https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet)
> > >>
> > >> Other items of immediate interest
> > >>
> > >> - C++ API to the file metadata (read + write)
> > >> - Conda packaging for built artifacts (to make parquet-cpp easier
> > >> for Python programmers to install portably when the time comes). I
> > >> got Thrift C++ into conda-forge this week so this should not be
> > >> hard now https://github.com/conda-forge/thrift-cpp-feedstock
> > >> - Expanding column scan benchmarks (thanks Uwe for kickstarting the
> > >> benchmarking effort!)
> > >> - Perf improvements for the RLE decoder
> > >>
> > >> Thanks
> > >> Wes
> > >>
> > >> On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem <[email protected]>
> > wrote:
> > >>>
> > >>> The actual hangout url is
> > >>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> > >>>
> > >>> On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem <[email protected]>
> > wrote:
> > >>>
> > >>>> starting in 5 mins:
> > >>>> https://plus.google.com/hangouts/_/event/parquet_sync_up
> > >>>>
> > >>>> On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem
> > >>>> <[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> It is happening at 4pm PT on google hangout
> > >>>>> https://plus.google.com/hangouts/_/event/parquet_sync_up
> > >>>>>
> > >>>>> (we can do a different time next time, based on timezone
> preferences.
> > >>>>> Afternoon is better for Asia. Morning is better for Europe)
> > >>>>>
> > >>>>> --
> > >>>>> Julien
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Julien
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Julien
> > >
> > >
> >
> >
> >
> > --
> > regards,
> > Deepak Majeti
> >
>
>
>
> --
> Julien
>



-- 
Julien

Re: Parquet sync uo

Reply via email to