Re: Iceberg sync notes - 10 March 2021

Yan Yan Tue, 16 Mar 2021 17:04:14 -0700

Hi Chen,

I think currently the sort order support is mostly only on the Iceberg spec
level. The user can specify sort order on table, and ideally writer should
use this information on the table to determine the right sort order it
should use for writing data, and persist this information to data files.
But at this moment we don't have integration between engine and Iceberg
library to allow writers to write anything other than 0 (unsorted, which is
default) for any data files; and even it's possible, I think we are still
lacking engines' support for sort order in general; I think there are
active efforts on Spark to support sort order in writing but I'm not sure
about the other engines. And yes, it should be the responsibility of the
writer to ensure the data is indeed sorted before writing the sort order
information to files. And for your second question, I think we don't have
this support for now, which is mostly due to the feature still under
development for the same reason mentioned above.


Thank you,
Yan


On Tue, Mar 16, 2021 at 2:33 PM Chen Song <[email protected]> wrote:

> Thanks Yan. I have a question about sort order support. I saw
> https://iceberg.apache.org/spec/#sorting talking about support on
> sorting. And I found related tickets like #1373
> <https://github.com/apache/iceberg/pull/1373> and #1975
> <https://github.com/apache/iceberg/pull/1975>. However, it is not clear
> to me how this is enforced end to end.
>
>    - Currently, it seems that the sort order info can be persisted in
>    manifests. On data files, how is this enforced? Is the writer's
>    responsibility to ensure the data is sorted before commit based on the sort
>    order info defined on table level?
>    - Assuming data is sorted within each data file. Is the Iceberg core
>    reader able to return all data (across partitions possibly) in total sorted
>    order when reading, based on the sort order information stored in 
> manifests?
>
> Essentially, if we want to support sorting on the underlying data when
> read using core data API, what is the right and required things to do?
>
> Thanks,
> Chen
>
>
> On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <[email protected]> wrote:
>
>> Hi Chen,
>>
>> Here is the doc on remaining tasks for format V2 that I updated with the
>> latest status today, including individual PRs pending review and tasks
>> needed that are V2-blocking:
>> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit
>> Please feel free to comment/edit as needed.
>>
>> As mentioned in Anton's email, it would be great if more people can
>> review the pending PRs.
>>
>> Thank you!
>> Yan
>>
>>
>> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <[email protected]> wrote:
>>
>>> Thanks for the summary. On V2 format. Is there a google doc to review,
>>> or any sort of backlog of tickets to track?
>>>
>>> Chen
>>>
>>> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi
>>> <[email protected]> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> Thanks to folks who attended. I added my notes from the last sync.
>>>> Please feel free to add/correct if I missed anything.
>>>>
>>>> Main points
>>>>
>>>>    - Highlights
>>>>       - StreamingOffset for Structured Streaming in Spark
>>>>       - New Actions API
>>>>       - Spark procedure for partial import of existing tables
>>>>       - Subsurface talks are online
>>>>       - Call for papers is open at ApacheCon and Subsurface
>>>>    - Releases
>>>>       - 0.11.1
>>>>          - Waiting for the fix on handling situations when the
>>>>          metastore fails during commit (#2317).
>>>>       - 0.12.0
>>>>          - Should include Spark 3.1 support
>>>>          - V2 format items should be included whenever possible but
>>>>          should not block the release
>>>>          - No new blockers
>>>>          - Ideally, end of March
>>>>       - Table corruption issue (#2317
>>>>    <https://github.com/apache/iceberg/issues/2317>)
>>>>       - We may corrupt tables if the metastore fails during commit and
>>>>       the commit state is unknown. Iceberg may delete files that were 
>>>> actually
>>>>       committed.
>>>>       - A lot of folks have seen this issue.
>>>>       - Parth has shared some thoughts from a discussion they had
>>>>       internally here
>>>>       
>>>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>
>>>>       .
>>>>       - We can handle this issue in two phases:
>>>>          - Don’t corrupt the table (Russell has a PR)
>>>>          - Avoid duplicated results if operations are blindly retried
>>>>          (can be done in a follow-up PR)
>>>>       - Seems worth including the first part in 0.11.1
>>>>    - V2 format
>>>>       - Open points:
>>>>          - Primary key or row id for upserts
>>>>          - Propagating the sort order id for files on write
>>>>       - Need more reviewers
>>>>    - Encryption
>>>>       - Multiple people expressed interested in data encryption.
>>>>       - Existing work by John here
>>>>       <https://github.com/apache/iceberg/pull/1918>.
>>>>       - Ideally, should leverage as much as possible of modular
>>>>       encryption in Parquet 1.12 discussed here
>>>>       <https://github.com/apache/iceberg/issues/1413>.
>>>>       - Agreed to start a thread on the dev list.
>>>>    - ChachingCatalog issues (#2319
>>>>    <https://github.com/apache/iceberg/issues/2319>)
>>>>       - The current behavior leads to stale data if multiple sessions
>>>>       are used.
>>>>       - No ideal solution due to Spark limitations. Agreed to discuss
>>>>       in the issue.
>>>>    - Multi-table transactions
>>>>       - Jacques has proposed an API here
>>>>       <https://github.com/apache/iceberg/pull/1849> and is about to
>>>>       start working on an implementation.
>>>>       - Agreed to collaborate on the dev list. More eyes would be
>>>>       great.
>>>>
>>>>
>>>> The link to the doc:
>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>
> --
> Chen Song
>
>

Re: Iceberg sync notes - 10 March 2021

Reply via email to