Re: parquet sync starting now

Marcel Kornacker Thu, 23 Feb 2017 15:06:44 -0800

Yes, that sounds like a good idea.


On Thu, Feb 23, 2017 at 2:16 PM, Wes McKinney <[email protected]> wrote:
> I made some comments about sharing C++ code more generally amongst
> Impala, Kudu, Parquet, and Arrow.
>
> There's a significant amount of byte and bit processing code that
> should have little coupling to the Impala or Kudu runtime:
>
> - SIMD algorithms for hashing
> - RLE encoding
> - Dictionary encoding
> - Bit packing and unpacking (we actually had a contribution to
> parquet-cpp from Daniel Lemire on this)
>
> Since Impala's Parquet scanner is tightly coupled to its in-memory
> data structures, using the Parquet reading and writing classes in
> parquet-cpp would require more careful analysis. The sharing of
> generic algorithms and SIMD utilities seems less controversial to me.
>
> Since Arrow is more of a library to be linked into other projects
> (e.g. parquet-cpp links against libarrow and uses its headers), and
> Arrow needs to do all things things as well as Parquet, we're planning
> to migrate this code to the Arrow codebase. So it might make sense for
> Arrow to be the place to assemble generic vectorized processing code,
> then link libarrow.a into parquet-cpp, Impala, and Kudu. I can help
> with as much of the legwork as possible with this, and I think all of
> our projects would benefit from the unification of efforts and unit
> testing / benchmarking.
>
> Thanks
> Wes
>
> On Thu, Feb 23, 2017 at 4:46 PM, Marcel Kornacker <[email protected]> wrote:
>> Regarding timestamp with timezone: I'm not sure whether the SQL
>> standard requires the timezone to be stored along with the timestamp
>> for 'timestamp with timezone' (at least Oracle and Postgres diverge on
>> that topic).
>>
>> Cc'ing Greg Rahn to shed some more light on that.
>>
>> Regarding 'make Impala depend on parquet-cpp': could someone expand on
>> why we want to do this? There probably is overlap between
>> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have
>> specific requirements (and are also different from each other), so
>> trying to unify this into parquet-cpp seems difficult.
>>
>> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem <[email protected]> wrote:
>>>  Attendees/agenda:
>>> - Nandor, Zoltan (Cloudera/file formats)
>>> - Lars (Cloudera/Impala)" Statistics progress
>>> - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps
>>> - Wes (twosigma): parquet cpp rc. 1.0 Release
>>> - Julien (Dremio): parquet metadata. Statistics.
>>> - Deepak (HP/Vertica): Parquet-cpp
>>> - Kazuaki:
>>> - Ryan was excused :)
>>>
>>> Note:
>>>  - Statistics: https://github.com/apache/parquet-format/pull/46
>>>    - Impala is waiting for parquet-format to settle on the format to
>>> finalize their simple mentation.
>>>    - Action: Julien to follow up with Ryan on the PR
>>>
>>>  - Int96 timestamps: https://github.com/apache/parquet-format/pull/49
>>> (needs Ryan's feedback)
>>>    - format is nanosecond level timestamp from midnight (64 bits) followed
>>> by number of days (32 bits)
>>>    - it sounds like int96 ordering is different from natural byte array
>>> ordering because days is last in the bytes
>>>    - discussion about swapping bytes:
>>>       - format dependent on the boost library used
>>>       - there could be performance concerns in Impala against changing it
>>>       - there may be a separate project in impala to swap the bytes for
>>> kudu compatibility.
>>>    - discussion about deprecating int96:
>>>      - need to be able to read them always
>>>      - not need to define ordering if we have a clear replacement
>>>      - Need to clarify the requirement for alternative .
>>>      - int64 could be enough it sounds that nanosecond granularity might
>>> not be needed.
>>>    - Julien to create JIRAs:
>>>      - int96 ordering
>>>      - int96 deprecation, replacement.
>>>
>>> - extra timestamp logical type:
>>>  - floating timestamp: (not TZ stored. up to the reader to interpret TS
>>> based on their TZ)
>>>     - this would be better for following sql standard
>>>     - Julien to create JIRA
>>>  - timestamp with timezone (per SQL):
>>>     - each value has timezone
>>>     - TZ can be different for each value
>>>     - Julien to create JIRA
>>>
>>>  - parquet-cpp 1.0 release
>>>    - Uwe to update release script in master.
>>>    - Uwe to launch a new vote with new RC
>>>
>>>  - make impala depend on parquet-cpp
>>>   - duplication between parquet/impala/kudu
>>>   - need to measure level of overlap
>>>   - Wes to open JIRA for this
>>>   - also need an "apache commons for c++” for SQL type operations:
>>>      -> could be in arrow
>>>
>>>   - metadata improvements.
>>>    - add page level metadata in footer
>>>    - page skipping.
>>>    - Julien to open JIRA.
>>>
>>>  - add version of the writer in the footer (more precise than current).
>>>    - Zoltan to open Jira
>>>    - possibly add bitfield for bug fixes.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 23, 2017 at 10:01 AM, Julien Le Dem <[email protected]> wrote:
>>>
>>>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>>>>
>>>> --
>>>> Julien
>>>>
>>>
>>>
>>>
>>> --
>>> Julien

Re: parquet sync starting now

Reply via email to