Yes, that sounds like a good idea.
On Thu, Feb 23, 2017 at 2:16 PM, Wes McKinney <[email protected]> wrote: > I made some comments about sharing C++ code more generally amongst > Impala, Kudu, Parquet, and Arrow. > > There's a significant amount of byte and bit processing code that > should have little coupling to the Impala or Kudu runtime: > > - SIMD algorithms for hashing > - RLE encoding > - Dictionary encoding > - Bit packing and unpacking (we actually had a contribution to > parquet-cpp from Daniel Lemire on this) > > Since Impala's Parquet scanner is tightly coupled to its in-memory > data structures, using the Parquet reading and writing classes in > parquet-cpp would require more careful analysis. The sharing of > generic algorithms and SIMD utilities seems less controversial to me. > > Since Arrow is more of a library to be linked into other projects > (e.g. parquet-cpp links against libarrow and uses its headers), and > Arrow needs to do all things things as well as Parquet, we're planning > to migrate this code to the Arrow codebase. So it might make sense for > Arrow to be the place to assemble generic vectorized processing code, > then link libarrow.a into parquet-cpp, Impala, and Kudu. I can help > with as much of the legwork as possible with this, and I think all of > our projects would benefit from the unification of efforts and unit > testing / benchmarking. > > Thanks > Wes > > On Thu, Feb 23, 2017 at 4:46 PM, Marcel Kornacker <[email protected]> wrote: >> Regarding timestamp with timezone: I'm not sure whether the SQL >> standard requires the timezone to be stored along with the timestamp >> for 'timestamp with timezone' (at least Oracle and Postgres diverge on >> that topic). >> >> Cc'ing Greg Rahn to shed some more light on that. >> >> Regarding 'make Impala depend on parquet-cpp': could someone expand on >> why we want to do this? There probably is overlap between >> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have >> specific requirements (and are also different from each other), so >> trying to unify this into parquet-cpp seems difficult. >> >> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem <[email protected]> wrote: >>> Attendees/agenda: >>> - Nandor, Zoltan (Cloudera/file formats) >>> - Lars (Cloudera/Impala)" Statistics progress >>> - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps >>> - Wes (twosigma): parquet cpp rc. 1.0 Release >>> - Julien (Dremio): parquet metadata. Statistics. >>> - Deepak (HP/Vertica): Parquet-cpp >>> - Kazuaki: >>> - Ryan was excused :) >>> >>> Note: >>> - Statistics: https://github.com/apache/parquet-format/pull/46 >>> - Impala is waiting for parquet-format to settle on the format to >>> finalize their simple mentation. >>> - Action: Julien to follow up with Ryan on the PR >>> >>> - Int96 timestamps: https://github.com/apache/parquet-format/pull/49 >>> (needs Ryan's feedback) >>> - format is nanosecond level timestamp from midnight (64 bits) followed >>> by number of days (32 bits) >>> - it sounds like int96 ordering is different from natural byte array >>> ordering because days is last in the bytes >>> - discussion about swapping bytes: >>> - format dependent on the boost library used >>> - there could be performance concerns in Impala against changing it >>> - there may be a separate project in impala to swap the bytes for >>> kudu compatibility. >>> - discussion about deprecating int96: >>> - need to be able to read them always >>> - not need to define ordering if we have a clear replacement >>> - Need to clarify the requirement for alternative . >>> - int64 could be enough it sounds that nanosecond granularity might >>> not be needed. >>> - Julien to create JIRAs: >>> - int96 ordering >>> - int96 deprecation, replacement. >>> >>> - extra timestamp logical type: >>> - floating timestamp: (not TZ stored. up to the reader to interpret TS >>> based on their TZ) >>> - this would be better for following sql standard >>> - Julien to create JIRA >>> - timestamp with timezone (per SQL): >>> - each value has timezone >>> - TZ can be different for each value >>> - Julien to create JIRA >>> >>> - parquet-cpp 1.0 release >>> - Uwe to update release script in master. >>> - Uwe to launch a new vote with new RC >>> >>> - make impala depend on parquet-cpp >>> - duplication between parquet/impala/kudu >>> - need to measure level of overlap >>> - Wes to open JIRA for this >>> - also need an "apache commons for c++” for SQL type operations: >>> -> could be in arrow >>> >>> - metadata improvements. >>> - add page level metadata in footer >>> - page skipping. >>> - Julien to open JIRA. >>> >>> - add version of the writer in the footer (more precise than current). >>> - Zoltan to open Jira >>> - possibly add bitfield for bug fixes. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Feb 23, 2017 at 10:01 AM, Julien Le Dem <[email protected]> wrote: >>> >>>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up >>>> >>>> -- >>>> Julien >>>> >>> >>> >>> >>> -- >>> Julien
