Re: [DISCUSS] UUID type

2021-09-13 Thread Piotr Findeisen
void
>>> the user headache, people will probably choose to store values as strings.
>>> Using a string would mean that more than half the value is needlessly
>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>> entire value. And since engines don't know what's in the string, the full
>>> value must be used in comparison, which is extra work and extra space.
>>>
>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>> one case where you could argue that it doesn't matter very much that they
>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>> for ID columns because you can generate them without coordination (unlike
>>> an incrementing ID) and that's a concern because the use as an ID makes
>>> them likely to be join keys.
>>>
>>> If we want the values to be stored as 16-byte fixed, then we need to
>>> make it easy to get the expected string representation in and out, just
>>> like we do with date/time types. I don't think that's specific to any
>>> engine.
>>>
>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau 
>>> wrote:
>>>
>>>> I think points 1&2 don't really apply since a fixed width binary
>>>> already covers those properties.
>>>>
>>>> It seems like this isn't really a concern of iceberg but rather a
>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>> binary type. That way you still have the desired ux without exposing those
>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>> imo.
>>>>
>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>> for UUIDs at all.
>>>>> After all, this is just a primitive type, which is commonly used for
>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>
>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>> The compact representation in the file, and compact representation in
>>>>> memory in the query engine are the ones mentioned above.
>>>>>
>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(),  without
>>>>> need for casting to varchar.
>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>> compact representation.
>>>>>
>>>>> Thus i think it would be good to have them.
>>>>>
>>>>> Best
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue  wrote:
>>>>>
>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>
>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>> as many opportunities to take advantage of UUIDs as I thought. My 
>>>>>> original
>>>>>> assumption was that we could do things like bucket on UUID fields or 
>>>>>> assume
>>>>>> that a UUID field has a high NDV. But that's not necessarily the case 
>>>>>> with
>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>> invest in support for UUID.
>>>>>>
>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>> some of these things with the row identifier fields. Engines can assume
>>>&g

Re: [DISCUSS] UUID type

2021-09-17 Thread Piotr Findeisen
Hi Ryan,

Please advise whatever feels more appropriate from your perspective.
>From my perspective, we could just go ahead and merge Trino Iceberg support
for UUID, since this is just fulfilling the spec as it is defined today.

Best
PF


On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue  wrote:

> I don't think we necessarily reached consensus, but I think the general
> trend toward the end was to keep support for UUID. Should we start a vote
> to validate consensus?
>
> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard 
> wrote:
>
>> Just following up on Piotr's message here.
>>
>> Have we converged? I think most people would assume that silence is a
>> vote for the status-quo.
>>
>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen 
>> wrote:
>>
>>> Hi,
>>>
>>> It seems we converged here that UUID should remain included.
>>> I read this as a consensus reached, but it may be subjective. Did we
>>> objectively reached consensus on this?
>>>
>>> From Iceberg project perspective there isn't anything to do, as UUID
>>> already *is* part of the spec (
>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>> Trino Iceberg PR adding support for UUID
>>> https://github.com/trinodb/trino/pull/8747 was pending merge while this
>>> conversation has been ongoing.
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B  wrote:
>>>
>>>> Hi Ryan and all,
>>>>
>>>> That sounds like a reasonable reason to leave IP address types out. In
>>>> my experience, dedicated IP address types are mostly found in logging tools
>>>> and other things for sysadmins / DevOps etc.
>>>>
>>>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>>>> security reasons) but usually stored as string or manipulated in a UDF.
>>>> They’re not commonly supported types.
>>>>
>>>> I would also draw the line at UUID types.
>>>>
>>>> - Kyle Bendickson
>>>>
>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue  wrote:
>>>>
>>>> 
>>>> Jacques, you make some good points here. I think my argument about
>>>> usability leading to performance issues is a stronger argument for engines
>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>
>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>> table that's difficult to query in other engines, where people would
>>>> probably choose to store the data as a string. On the other hand, if
>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>> UUID string representation before passing data to the other engines.
>>>> While the engines would be using 36-byte values in join keys, the user
>>>> experience issue is fixed and the data is more compact on disk and in
>>>> Iceberg's bounds metadata.
>>>>
>>>> While having a UUID type in Iceberg can't really help engines that
>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>> slightly better to have the UUID type in general since at least one engine
>>>> supports it and it provides the expected user experience with a compact
>>>> representation.
>>>>
>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>> joins, which are an important operation. IPv4 representations aren't that
>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>> typically in a UDF and not the query engine. And you can always keep both
>>>> representations in a table fairly inexpensively. Does this sound like a
>>>> valid rationale for having UUID but not IP types?
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau 
>>>> wrote:
>>>>
>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>>>> type. Which engines are you thinking of that have a native UUID type
>>>>&g

Re: [DISCUSS] UUID type

2021-07-29 Thread Piotr Findeisen
Hi,

I agree with Ryan, that it takes some precautions before one can assume
uniqueness of UUID values, and that this shouldn't be any special for UUIDs
at all.
After all, this is just a primitive type, which is commonly used for
certain things, but "commonly" doesn't mean "always".

The advantages of having a dedicated type are on 3 layers.
The compact representation in the file, and compact representation in
memory in the query engine are the ones mentioned above.

The third layer is the usability. Seeing a UUID column i know what values i
can expect, so it's more descriptive than `id char(36)`.
This also means i can CREATE TABLE ... AS SELECT uuid(),  without need
for casting to varchar.
It also removes temptation of casting uuid to varbinary to achieve compact
representation.

Thus i think it would be good to have them.

Best
PF



On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue  wrote:

> The original reason why I added UUID to the spec was that I thought there
> would be opportunities to take advantage of UUIDs as unique values and to
> optimize the use of UUIDs. I was thinking about auto-increment ID fields
> and how we might do something similar in Iceberg.
>
> The reason we have thought about removing UUID is that there aren't as
> many opportunities to take advantage of UUIDs as I thought. My original
> assumption was that we could do things like bucket on UUID fields or assume
> that a UUID field has a high NDV. But that's not necessarily the case with
> when a UUID field is a foreign key, only when it is used as an identifier
> or primary key. Before Jack added tracking for row identifier fields, we
> couldn't know that a UUID was unique in a table. As a result, we didn't
> invest in support for UUID.
>
> Quick aside: Now that row identifier fields are tracked, we can do some of
> these things with the row identifier fields. Engines can assume that the
> tuple of row identifier fields is unique in a table for join estimation.
> And engines can use row identifier fields in sort keys to ensure lots of
> partition split locations (this is really important for Spark).
>
> Coming back to UUIDs, the second reason to have a UUID type is still
> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
> strings that are more than twice as large, or even worse UCS-16 Strings
> that are 4x as large. Since UUIDs are likely to be used in joins, this
> could really help engines as long as they can keep the values as
> fixed-width binary.
>
> I could go either way on this. I think it is valuable to have a compact
> representation for UUIDs rather than using the string representation. But
> that will require investing in the type and building support in engines
> that won't take advantage of it. If Trino can use this, I think it may be
> worth keeping and investing in.
>
> Ryan
>
> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye  wrote:
>
>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>> think It is more about user experience, whether the conversion is done at
>> the user side or Iceberg and engine side. Many people just store UUID as a
>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>> Iceberg can optimize this common use case internally for users. There might
>> be some other benefits I overlooked, but maybe the complication introduced
>> by this type does not really justify the slightly better user experience. I
>> am also on the fence about it.
>>
>> -Jack Ye
>>
>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau 
>> wrote:
>>
>>> What specific arguments are there for it being a first class type
>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>> binary seems to cover the cases I see in terms of actual functionality in
>>> the iceberg libraries or engines…
>>>
>>>
>>>
>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan  wrote:
>>>
 One conversation I used to come across regarding UUID deprecation was
 from https://github.com/apache/iceberg/pull/1611

 Thanks,
 Yan

 On Tue, Jul 27, 2021 at 1:07 PM Peter Vary 
 wrote:

> Hi Joshua,
>
> I do not have a strong preference about the UUID type, but I would
> like the highlight, that the type is handled inconsistently in Iceberg 
> with
> different file formats. (See:
> https://github.com/apache/iceberg/issues/1881)
>
> If we keep the type, it would be good to standardize the handling in
> every file format.
>
> Thanks, Peter
>
> On Tue, 27 Jul 2021, 17:08 Joshua Howard, 
> wrote:
>
>> Hi.
>>
>> UUID is a current data type according to the Iceberg spec (
>> https://iceberg.apache.org/spec/#primitive-types), but there seems
>> to have been some discussion about removing it? I could not find the
>> original discussion, but a reference to the discussion can be found here 
>> (

Re: [DISCUSS] Moving to apache-iceberg Slack workspace

2021-07-29 Thread Piotr Findeisen
Ryan,

you should be able to create a semi-permanent invite link.
in Trino, we are  almost successful with a static link (see
https://trino.io/slack.html)
admittedly they still break on some occasions (not very often). I would
imagine a custom sign up form could have some issues from time to time as
well, though.

Having said that, i haven't seen an invite link for apache-iceberg slack
workspace yet, unless "https://join.slack.com/t/apache-iceberg/; was
supposed to be it.
For Trino slack the link is much longer, and looks like "
https://join.slack.com/t/trinodb/shared_invite/zt-sbw8lqer-DL3zksjpm6fyIsSQz74MBA
".

Best,
PF



On Wed, Jul 28, 2021 at 11:29 PM Ryan Blue  wrote:

> That's a great idea, Jacques. I agree that the Slack community should be
> controlled by the PMC.
>
> On Wed, Jul 28, 2021 at 10:41 AM Jacques Nadeau 
> wrote:
>
>> My one recommendation would be that if you go off Apache infra that you
>> make sure all the PMC members are admins of the new account.
>>
>> On Wed, Jul 28, 2021 at 8:35 AM Ryan Blue  wrote:
>>
>>> No problem, the site hasn't been deployed yet.
>>>
>>> I've also looked a bit into the invite issue for the apache-iceberg
>>> community and I think we should be able to set something up so that people
>>> can invite themselves. Thanks for raising this as an issue, Piotr. I didn't
>>> realize that the shared invite URLs expire quickly. We'll either keep that
>>> refreshed or set up a self-invite form.
>>>
>>> On Wed, Jul 28, 2021 at 8:23 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> +1 Also I merged the change, sorry if that was premature.
>>>>
>>>> On Jul 28, 2021, at 10:19 AM, Eduard Tudenhoefner 
>>>> wrote:
>>>>
>>>> +1 moving to the apache-iceberg <https://apache-iceberg.slack.com/>
>>>> slack workspace
>>>>
>>>> On Wed, Jul 28, 2021 at 5:16 PM Ryan Blue  wrote:
>>>>
>>>>> Yes, we can update the Slack link. But we want to give everyone a
>>>>> chance to speak up either in support or not before we make changes to how
>>>>> we recommend communicating in the community.
>>>>>
>>>>> If you're in favor of moving to the apache-iceberg Slack space, please
>>>>> speak up!
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Jul 28, 2021 at 12:51 AM Eduard Tudenhoefner <
>>>>> edu...@dremio.com> wrote:
>>>>>
>>>>>> Could we just update the slack link to
>>>>>> https://join.slack.com/t/apache-iceberg/ on the website (see PR#2882
>>>>>> <https://github.com/apache/iceberg/pull/2882>)?
>>>>>>
>>>>>> On Wed, Jul 28, 2021 at 7:13 AM Jack Ye  wrote:
>>>>>>
>>>>>>> Any updates on this? Given the fact of the currently broken
>>>>>>> invitation link, I think we should move asap.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 2:15 AM Piotr Findeisen <
>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I don't have opinion which Slack workspace this is in, as long as
>>>>>>>> it's easy to join.
>>>>>>>> Manual joining process is not healthy for sure.
>>>>>>>> Btw, the apache-iceberg is currently limited to @apache.org
>>>>>>>> emails, which some people do not have (e.g. i do not).
>>>>>>>> Will you be sharing an invite link or something?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jul 25, 2021 at 9:48 PM Ryan Blue  wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> A few weeks ago, we talked about whether to use a Slack workspace
>>>>>>>>> for Iceberg or whether to continue using the ASF's workspace. At the 
>>>>>>>>> time,
>>>>>>>>> we thought it was possible for anyone to invite themselves to the 
>>>>>>>>> ASF's
>>>>>>>>> Slack, so we thought it would be best to use the common one. But 
>>>>>>>>> after the
>>>>>>>>> self-invite link broke recently, we found out that ASF infra doesn't 
>>>>>>>>> want
>>>>>>>>> to fix the link anymore because of spammers, which leaves us without 
>>>>>>>>> a way
>>>>>>>>> for people to join us on Slack without requesting an invite.
>>>>>>>>>
>>>>>>>>> I think that requesting an invite is too much friction for someone
>>>>>>>>> that wants to join this community, so I propose we reconsider our 
>>>>>>>>> previous
>>>>>>>>> decision and move to apache-iceberg.slack.com.
>>>>>>>>>
>>>>>>>>> Any objections?
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: Proposal: Support for views in Iceberg

2021-07-29 Thread Piotr Findeisen
Hi Anjali,

That's a nice summary.

re dialect field. It shouldn't be a bit trouble to have it (or any other
way to identify application that created the view), and it might be useful.
Why not make it required from the start?

re "expanded/resolved SQL" -- i don't understand yet what we would put
there, so cannot comment.

I agree there it's nice to get something out of the door, and I see how the
current proposal fits some needs already.
However, i am concerned about the proliferation of non-cross-engine
compatible views, if we do that.

Also, if we later agree on any compatible approach (portable subset of SQL,
engine-agnostic IR, etc.), then from the perspective of each engine, it
would be a breaking change.
Unless we make the compatible approach as expressive as full power of SQL,
some views that are possible to create in v1 will not be possible to create
in v2.
Thus, if v1  is "some SQL" and v2 is "something awesomely compatible", we
may not be able to roll it out.

> the convention of common SQL has been working for a majority of users.
SQL features commonly used are column projections, simple filter
application, joins, grouping and common aggregate and scalar function. A
few users occasionally would like to use Trino or Spark specific functions
but are sometimes able to find a way to use a function that is common to
both the engines.


it's an awesome summary of what constructs are necessary to be able to
define useful views, while also keep them portable.

To be able to express column projections, simple filter application, joins,
grouping and common aggregate and scalar function in a structured IR, how
much effort do you think would be required?
We didn't really talk about downsides of a structured approach, other than
it looks complex.
if we indeed estimate it as a multi-year effort, i wouldn't argue for that.
Maybe i were overly optimistic though.


As Jack mentioned, for engine-specific approach that's not supposed to be
consumed by multiple engines, we may be better served with approach that's
outside of Iceberg spec, like https://github.com/trinodb/trino/pull/8540.


Best,
PF





On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
 wrote:

> Hi,
>
> Thank you for all the comments. I will try to address them all here
> together.
>
>
>- @all Cross engine compatibility of view definition: Multiple options
>such as engine agnostic SQL or IR of some form have been mentioned. We can
>all agree that all of these options are non-trivial to design/implement
>(perhaps a multi-year effort based on the option chosen) and merit further
>discussion. I would like to suggest that we continue this discussion but
>target this work for the future (v2?). In v1, we can add an optional
>dialect field and an optional expanded/resolved SQL field that can be
>interpreted by engines as they see fit. V1 can unlock many use cases where
>the views are either accessed by a single engine or multi-engine use cases
>where a (common) subset of SQL is supported. This proposal allows for
>desirable features such as versioning of views and a common format of
>storing view metadata while allowing extensibility in the future. *Does
>anyone feel strongly otherwise?*
>- @Piotr  As for common views at Netflix, the restrictions on SQL are
>not enforced, but are advised as best practices. The convention of common
>SQL has been working for a majority of users. SQL features commonly used
>are column projections, simple filter application, joins, grouping and
>common aggregate and scalar function. A few users occasionally would like
>to use Trino or Spark specific functions but are sometimes able to find a
>way to use a function that is common to both the engines.
>- @Jacques and @Jack Iceberg data types are engine agnostic and hence
>were picked for storing view schema. Thinking further, the schema field
>should be made 'optional', since not all engines require it. (e.g. Spark
>does not need it and Trino uses it only for validation).
>- @Jacques Table references in the views can be arbitrary objects such
>as tables from other catalogs or elasticsearch tables etc. I will clarify
>it in the spec.
>
> I will work on incorporating all the comments in the spec and make the
> next revision available for review soon.
>
> Regards,
> Anjali.
>
>
>
>
> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen 
> wrote:
>
>> Hi,
>>
>> Thanks Jack and Jacques for sharing your thoughts.
>>
>> I agree that tracking  dialect/origin is better than nothing.
>> I think having a Map {dialect: sql} is not going to buy us much.
>> I.e. it would be useful if there was some external app (or a human being)
>> that would write those alternative SQ

Re: [DISCUSS] Moving to apache-iceberg Slack workspace

2021-07-29 Thread Piotr Findeisen
Hi,

 @Ryan Blue  , where can one find at least a temporary
signup link for apache-iceberg slack?

Eduard, you can create a link that does not expire. You still need a new
link every 2 thousand users joining.
 Below is an example what I see when creating a new invite link.


[image: image.png]



Best
PF





On Thu, Jul 29, 2021 at 12:51 PM Eduard Tudenhoefner 
wrote:

> The default invite link expires after 30 days, that's why I was looking
> for alternatives. Maybe a slack admin can check if the invite link can be
> configured to not expire.
>
> On Thu, Jul 29, 2021, 11:51 Piotr Findeisen 
> wrote:
>
>> Ryan,
>>
>> you should be able to create a semi-permanent invite link.
>> in Trino, we are  almost successful with a static link (see
>> https://trino.io/slack.html)
>> admittedly they still break on some occasions (not very often). I would
>> imagine a custom sign up form could have some issues from time to time as
>> well, though.
>>
>> Having said that, i haven't seen an invite link for apache-iceberg slack
>> workspace yet, unless "https://join.slack.com/t/apache-iceberg/; was
>> supposed to be it.
>> For Trino slack the link is much longer, and looks like "
>> https://join.slack.com/t/trinodb/shared_invite/zt-sbw8lqer-DL3zksjpm6fyIsSQz74MBA
>> ".
>>
>> Best,
>> PF
>>
>>
>>
>> On Wed, Jul 28, 2021 at 11:29 PM Ryan Blue  wrote:
>>
>>> That's a great idea, Jacques. I agree that the Slack community should be
>>> controlled by the PMC.
>>>
>>> On Wed, Jul 28, 2021 at 10:41 AM Jacques Nadeau 
>>> wrote:
>>>
>>>> My one recommendation would be that if you go off Apache infra that you
>>>> make sure all the PMC members are admins of the new account.
>>>>
>>>> On Wed, Jul 28, 2021 at 8:35 AM Ryan Blue  wrote:
>>>>
>>>>> No problem, the site hasn't been deployed yet.
>>>>>
>>>>> I've also looked a bit into the invite issue for the apache-iceberg
>>>>> community and I think we should be able to set something up so that people
>>>>> can invite themselves. Thanks for raising this as an issue, Piotr. I 
>>>>> didn't
>>>>> realize that the shared invite URLs expire quickly. We'll either keep that
>>>>> refreshed or set up a self-invite form.
>>>>>
>>>>> On Wed, Jul 28, 2021 at 8:23 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> +1 Also I merged the change, sorry if that was premature.
>>>>>>
>>>>>> On Jul 28, 2021, at 10:19 AM, Eduard Tudenhoefner 
>>>>>> wrote:
>>>>>>
>>>>>> +1 moving to the apache-iceberg <https://apache-iceberg.slack.com/>
>>>>>> slack workspace
>>>>>>
>>>>>> On Wed, Jul 28, 2021 at 5:16 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> Yes, we can update the Slack link. But we want to give everyone a
>>>>>>> chance to speak up either in support or not before we make changes to 
>>>>>>> how
>>>>>>> we recommend communicating in the community.
>>>>>>>
>>>>>>> If you're in favor of moving to the apache-iceberg Slack space,
>>>>>>> please speak up!
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Wed, Jul 28, 2021 at 12:51 AM Eduard Tudenhoefner <
>>>>>>> edu...@dremio.com> wrote:
>>>>>>>
>>>>>>>> Could we just update the slack link to
>>>>>>>> https://join.slack.com/t/apache-iceberg/ on the website (see
>>>>>>>> PR#2882 <https://github.com/apache/iceberg/pull/2882>)?
>>>>>>>>
>>>>>>>> On Wed, Jul 28, 2021 at 7:13 AM Jack Ye 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Any updates on this? Given the fact of the currently broken
>>>>>>>>> invitation link, I think we should move asap.
>>>>>>>>>
>>>>>>>>> -Jack Ye
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 2:15 AM Piotr Findeisen <
>>>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>&

Re: [DISCUSS] Moving to apache-iceberg Slack workspace

2021-07-29 Thread Piotr Findeisen
Hi,

I was told the screenshot in my previous email doesn't show up, so sharing
it as link instead
https://gist.github.com/findepi/68e6a141d6ea06049c33e85c5ccd5835#gistcomment-3835460


Best
PF


On Thu, Jul 29, 2021 at 4:13 PM Piotr Findeisen 
wrote:

> Hi,
>
>  @Ryan Blue  , where can one find at least a temporary
> signup link for apache-iceberg slack?
>
> Eduard, you can create a link that does not expire. You still need a new
> link every 2 thousand users joining.
>  Below is an example what I see when creating a new invite link.
>
>
> [image: image.png]
>
>
>
> Best
> PF
>
>
>
>
>
> On Thu, Jul 29, 2021 at 12:51 PM Eduard Tudenhoefner 
> wrote:
>
>> The default invite link expires after 30 days, that's why I was looking
>> for alternatives. Maybe a slack admin can check if the invite link can be
>> configured to not expire.
>>
>> On Thu, Jul 29, 2021, 11:51 Piotr Findeisen 
>> wrote:
>>
>>> Ryan,
>>>
>>> you should be able to create a semi-permanent invite link.
>>> in Trino, we are  almost successful with a static link (see
>>> https://trino.io/slack.html)
>>> admittedly they still break on some occasions (not very often). I would
>>> imagine a custom sign up form could have some issues from time to time as
>>> well, though.
>>>
>>> Having said that, i haven't seen an invite link for apache-iceberg slack
>>> workspace yet, unless "https://join.slack.com/t/apache-iceberg/; was
>>> supposed to be it.
>>> For Trino slack the link is much longer, and looks like "
>>> https://join.slack.com/t/trinodb/shared_invite/zt-sbw8lqer-DL3zksjpm6fyIsSQz74MBA
>>> ".
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>> On Wed, Jul 28, 2021 at 11:29 PM Ryan Blue  wrote:
>>>
>>>> That's a great idea, Jacques. I agree that the Slack community should
>>>> be controlled by the PMC.
>>>>
>>>> On Wed, Jul 28, 2021 at 10:41 AM Jacques Nadeau <
>>>> jacquesnad...@gmail.com> wrote:
>>>>
>>>>> My one recommendation would be that if you go off Apache infra that
>>>>> you make sure all the PMC members are admins of the new account.
>>>>>
>>>>> On Wed, Jul 28, 2021 at 8:35 AM Ryan Blue  wrote:
>>>>>
>>>>>> No problem, the site hasn't been deployed yet.
>>>>>>
>>>>>> I've also looked a bit into the invite issue for the apache-iceberg
>>>>>> community and I think we should be able to set something up so that 
>>>>>> people
>>>>>> can invite themselves. Thanks for raising this as an issue, Piotr. I 
>>>>>> didn't
>>>>>> realize that the shared invite URLs expire quickly. We'll either keep 
>>>>>> that
>>>>>> refreshed or set up a self-invite form.
>>>>>>
>>>>>> On Wed, Jul 28, 2021 at 8:23 AM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 Also I merged the change, sorry if that was premature.
>>>>>>>
>>>>>>> On Jul 28, 2021, at 10:19 AM, Eduard Tudenhoefner 
>>>>>>> wrote:
>>>>>>>
>>>>>>> +1 moving to the apache-iceberg <https://apache-iceberg.slack.com/>
>>>>>>> slack workspace
>>>>>>>
>>>>>>> On Wed, Jul 28, 2021 at 5:16 PM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> Yes, we can update the Slack link. But we want to give everyone a
>>>>>>>> chance to speak up either in support or not before we make changes to 
>>>>>>>> how
>>>>>>>> we recommend communicating in the community.
>>>>>>>>
>>>>>>>> If you're in favor of moving to the apache-iceberg Slack space,
>>>>>>>> please speak up!
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Wed, Jul 28, 2021 at 12:51 AM Eduard Tudenhoefner <
>>>>>>>> edu...@dremio.com> wrote:
>>>>>>>>
>>>>>>>>> Could we just update the slack link to
>>>>>>>>> https://join.slack.com/t/apache-iceberg/ on the website (see
>>>>>>>>> PR#2882 <https://gi

Re: High memory usage with highly concurrent committers

2021-12-06 Thread Piotr Findeisen
Hi

Igor, does fs.gs.outputstream.upload.chunk.size affect the file size I can
upload?
Can i upload e.g. 1GB Parquet file, while also setting fs.gs.outputstream.
upload.chunk.size=8388608 (8MB / MiB)?

Best
PF


On Fri, Dec 3, 2021 at 5:33 PM Igor Dvorzhak  wrote:

> No, right now this is a global property for the Hadoop FS instance. You
> either need to use different clients/Hadoop FS instances to write different
> files or switch to the direct upload mode (
> fs.gs.outputstream.direct.upload.enable=true), which could be better for
> your use case (in this write mode nothing cached in the memory and streamed
> to GCS directly, but it does not allow failed upload retries), depending on
> the parquet file sizes that you write.
>
> Also, you may want to test how critical 64MiB buffer size is for your
> application, it may be the case that 16MiB, for example, will get you
> desired performance for parquet file writes and good enough memory
> consumption.
>
> But on a broader note this seems to be one of the reasons why it could be
> good to have specialized Iceberg GcsFileIO, if Iceberg API allows, it can
> have separate write configuration optimized for metadata and data files.
>
> On Fri, Dec 3, 2021 at 6:24 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
>> Thanks Igor. This may help mitigate the problem.
>>
>>
>>
>> But it looks like it applies to all files. We still want data (parquet)
>> files to allocate 64 MiB (seems reasonable). For metadata, a smaller size
>> is better. Is there a way to set the property based on file suffix or file
>> type?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Igor Dvorzhak 
>> *Sent:* Thursday, December 2, 2021 8:09 PM
>> *To:* dev@iceberg.apache.org
>> *Subject:* Re: High memory usage with highly concurrent committers
>>
>>
>>
>> For each written object GCS connector allocates ~64MiB of memory by
>> default to improve performance of large object writes. If you want to
>> reduce memory utilization in cases when you write many files at once you
>> just need to reduce upload chunk size to 8MiB, for example:
>> fs.gs.outputstream.upload.chunk.size=8388608
>>
>>
>>
>> On Wed, Dec 1, 2021 at 3:20 PM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> That is correct Daniel.
>>
>>
>>
>> I’ve tried to explain our use of S3FileIO with GCS in the “Supporting
>> gs:// prefix …” thread.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Daniel Weeks 
>> *Sent:* Wednesday, December 1, 2021 11:46 AM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: High memory usage with highly concurrent committers
>>
>>
>>
>> I feel like what Mayur was saying is that S3FileIO actually works with
>> GCS (it appears there is some S3 compatible API for GCS).
>>
>>
>>
>> If that is the case, then S3FileIO can be used natively against GCS,
>> which wouldn't require the ResolvingRileIO (just supporting the GCS URI
>> schemes).
>>
>>
>>
>> This is new to me and I haven't tested this, but Mayur, if this does
>> work, please share how you configured S3FileIO.
>>
>>
>>
>> -Dan
>>
>>
>>
>> On Wed, Dec 1, 2021 at 12:40 AM Jack Ye  wrote:
>>
>> We are in the process of supporting multiple file system schemes using
>> ResolvingFileIO, Ryan just added the initial implementation:
>> https://github.com/apache/iceberg/pull/3593
>>
>>
>>
>> -Jack
>>
>>
>>
>> On Tue, Nov 30, 2021 at 6:41 PM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> Thanks Ryan.
>>
>>
>>
>> I’m looking at the heapdump. At a preliminary look in jvisualvm, I see
>> the following top two objects:
>>
>> 1.  ‘byte[]’ : 87% of memory usage, (>100k instances with a total of
>> 3.2G in one of my tests). I checked some of the reference and find that
>> they are from
>> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.
>>  MediaHttpUploader
>> <- GoogleHadoopOutputStream <- AvroFileAppender. I also see references
>> coming from WriterBasedJsonGenerator, finalizer
>> (HadoopPositionOutputStream), etc as well. I’m not familiar with this code,
>> but is it possible that Hadoop output streams are not closed and close is
>> called the finalizers?
>>
>> 2.  ‘int[]’ : 12% usage (7k instances), but I can’t expand the
>> references.
>>
>>
>>
>> One interesting finding is that if I switch to the S3FileIO, the high
>> memory usage goes away and the memory usage is similar to the serialized
>> commits using a lock which is ~750 M for 128 parallel committers. And the
>> 750 M usage may fall-in line with the snapshots and manifest* objects.
>>
>>
>>
>> So, the high memory problem manifests only when using the default
>> HadoopFileSystem.
>>
>>
>>
>> Thanks, Mayur
>>
>>
>>
>> PS: I had to change S3FileIO locally to accept gs:// prefix so that it
>> works with GCS. Is there a plan to support gs:// prefix in the S3URI?
>>
>>
>>
>> *From:* Ryan Blue 
>> *Sent:* Tuesday, November 30, 2021 3:53 PM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: High memory usage with highly concurrent 

Re: Handling pandas.Timestamps in nanos

2021-12-03 Thread Piotr Findeisen
Hi,

I don't know about Pandas, but the question about timestamp precision is
interesting to me nonetheless.
At Starburst, we've had customer asking for nanosecond timestamp precision,
and this drove adding that capability to Trino.
(Actually, picosecond timestamp precision was implemented, but I am not
aware of any use cases explicitly benefiting from the increased precision.
It still good to be more future-proof, just in case)

Iceberg currently supports timestamps with microsecond precision.

Quick search over mailing list brings a suggestion/workaround to store such
values as two separate numeric fields (e.g. seconds and nanos-of-second).
This is doable for when number of columns affected is small, and when
having full control over the schema, but can be hard to sell to analysts
and BI users.

Are there plans to add timestamps with nanosecond precision?
Would that be separate timestamp type, or would we rather make timestamp
type parametric?
Or, maybe we just extend maximum precision of the timestamp type?

Regarding implementation considerations: for engines like Trino, it's
beneficial to know maximum precision of the data, since
microsecond-precision timestamps can be stored as 64-bit number, and thus
handled more efficiently than nanosecond-precision timestamps.

BR
PF



On Fri, Dec 3, 2021 at 9:52 PM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> Hi,
>
>
>
> Is there a best practice for handling the pandas.Timestamps (or
> numpy.datetime64) in nanos in Iceberg? How are the Python users working
> with the timestamps in nanos precision, especially if is a part of the
> PartitionSpec?
>
>
>
> Thanks,
>
> Mayur
>
>
>


Re: Drop table behavior

2021-11-23 Thread Piotr Findeisen
Hi,

When you come from storage perspective, then the current design of 'not
owning' location makes sense.

However, if you come from SQL perspective, then all this is impractical
limitation. Analysts and other SQL users want to be able to delete their
data  and must have confidence that all the data is removed.
Failing to do so may expose them to GDPR-related liabilities.

Therefore we should work towards (2). For starters, we should be able to
assume that tables with implicit location, do own their location.
Then we should have an option to validate location ownership for tables
with explicit location.

I don't know yet how tables with multiple locations fit into this picture,
or tables with manifest in one place, or data files in some other places.
SQL users wouldn't create such tables though.

BR
PF





On Tue, Nov 23, 2021 at 4:32 AM Jack Ye  wrote:

> +1 for item 1, the fact that we do not remove all data referenced by all
> metadata files seems like a bug to me that should be fixed. The table's
> pointer is already removed in the catalog with no way to rollback, so there
> is no reason for keeping those files around. I don't know if there is any
> historical context for us to only remove data in the latest metadata, maybe
> someone with context could provide more details.
>
> For item 2, I think this aligns with the discussion conclusions in the
> linked issues. At least we can have some flag saying the table location,
> data location and metadata location do not have other table data (a.k.a.
> the table owns those 3 prefixes), and then we can safely do a recursive
> deletion. This also seems to align with the intention for having a LIST API
> in FileIO discussed in https://github.com/apache/iceberg/issues/3212 for
> remove_orphan_files.
>
> -Jack
>
>
> On Mon, Nov 22, 2021 at 6:42 PM Yan Yan  wrote:
>
>> Hi everyone,
>>
>> Does anyone know across catalog implementations, when we drop tables with
>> *purge=true*, why do we only drop last metadata and files referred by
>> it, but not any of the previous metadata? e.g.
>>
>> *create iceberg table1*; <--- metadata.json-1
>> *insert into table1* ...; <--- metadata.json-2
>>
>> when I do *drop table1* after these two commands, `metadata.json-1` will
>> not be deleted. This will also mean if we rollback/compact table and then
>> drop, data files referred by some of the previous metadata files will also
>> not be deleted.
>>
>> I know the community used to talk about table location ownership for file
>> cleanup after dropping table (e.g.
>> https://github.com/apache/iceberg/issues/1764
>> https://github.com/trinodb/trino/issues/5616 ) but I'm not sure if they
>> could completely solve the problem since we can customize metadata/data
>> location, and I think we should still delete the past metadata.json even if
>> the table doesn't own any location.
>>
>> I was thinking about the following items:
>> 1. to make a change to delete past metadata.json files as well when the
>> table is dropped with *purge=true* (small change, doesn't tackle
>> rollback/compaction data files)
>> 2. add configuration regarding table's location ownership, and delete
>> underlying files in drop table if table owns location (more complicated)
>>
>> I think 1 should be relatively safe to do despite that it's a behavior
>> change, but want to run it through the community first.
>>
>> Thanks!
>> Yan
>>
>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Piotr Findeisen
Hi

Just curious. S3URI seems aws s3-specific. What would be the goal of using
S3URI with google cloud storage urls?
what problem are we solving?

PF


On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
wrote:

> Sounds reasonable to me if they are compatible
>
> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
>> Hi,
>>
>>
>>
>> We have URIs starting with gs:// representing objects on GCS. Currently,
>> S3URI doesn’t support gs:// prefix (see
>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>> to the list of S3 prefixes?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Piotr Findeisen
if S3FileIO is supposed to be used with other file systems, we should
consider proper class renames.
just my 2c

On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> Hi,
>
>
>
> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
> with the AWS S3 SDKs and if they are added to the list of supported
> prefixes, they work with S3FileIO.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Piotr Findeisen 
> *Sent:* Wednesday, December 1, 2021 10:58 AM
> *To:* Iceberg Dev List 
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> Hi
>
>
>
> Just curious. S3URI seems aws s3-specific. What would be the goal of using
> S3URI with google cloud storage urls?
>
> what problem are we solving?
>
>
>
> PF
>
>
>
>
>
> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
> wrote:
>
> Sounds reasonable to me if they are compatible
>
>
>
> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
> Hi,
>
>
>
> We have URIs starting with gs:// representing objects on GCS. Currently,
> S3URI doesn’t support gs:// prefix (see
> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
> Is there an existing JIRA for supporting this? Any objections to add “gs”
> to the list of S3 prefixes?
>
>
>
> Thanks,
>
> Mayur
>
>
>
>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Piotr Findeisen
erg to talk to these backends. The way we create S3FileIO is via
>>> the constructor and provide the S3Client as the constructor param; we do
>>> not use the initialize(Map) method in FileIO. Our custom
>>> catalog accepts the FileIO object at creation time. To talk to GCS, we
>>> create the S3Client with a few overrides (described below) and pass it to
>>> S3FileIO. After that, the rest of the S3FileIO code works as is. The only
>>> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid
>>> S3 prefix. This is the reason I sent the email.
>>>
>>>
>>>
>>> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO
>>> almost works out of the box and contains all the functionality needed to
>>> talk to GCS. The only special requirement is the creation of the S3Client
>>> and allow “gs” prefix in the URIs. Based on our early experiments and
>>> benchmarks, S3FileIO provides all the functionality we need and performs
>>> well, so we didn’t see a need to create a native GCS FileIO. Iceberg
>>> operations that we need are create, drop, read and write objects from S3
>>> and S3FileIO provides this functionality.
>>>
>>>
>>>
>>> We are managing ACLs (IAM in case of GCS) at the bucket level and that
>>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
>>> not experimented with ACLs or encryption with S3FileIO and that is a good
>>> question whether it works with GCS. But, if these features are not enabled
>>> via default settings, S3FileIO works just fine with GCS.
>>>
>>>
>>>
>>> I think there is a case for supporting S3-compatible backends in
>>> S3FileIO because a lot of the code is common. The question is whether we
>>> can cleanly expose the common S3FileIO code to work with these backends and
>>> separate out any specialization (if required) OR we want to have a
>>> different FileIO implementation for each of the other S3 compatible
>>> backends such as GCS? I’m eager to hear more from the community about this.
>>> I’m happy to discuss and follow long-term design direction of the Iceberg
>>> community.
>>>
>>>
>>>
>>> The S3Client for GCS is created as follows (currently the code is not
>>> open source so I’m sharing the steps only):
>>>
>>> 1. Create S3ClientBuilder.
>>>
>>> 2. Set GCS endpoint URI and region.
>>>
>>> 3. Set a credentials provider that returns null. You can set credentials
>>> here if you have static credentials.
>>>
>>> 4. Set ClientOverrideConfiguration with interceptors in the
>>> overrideConfiguration(). The interceptors are used to setup authorization
>>> header in requests (setting projectId, auth tokens, etc.) and do header
>>> translation for requests and responses.
>>>
>>> 5. Build the S3Client.
>>>
>>> 6. Pass the S3Client to S3FileIO.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Jack Ye 
>>> *Sent:* Wednesday, December 1, 2021 1:16 PM
>>> *To:* Iceberg Dev List 
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> Hi Mayur,
>>>
>>>
>>>
>>> I know many object storage services have allowed communication using the
>>> Amazon S3 client by implementing the same protocol, like recently the Dell
>>> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
>>> that could be optimized with a native FileIO, and the 2 examples I listed
>>> before both contributed their own FileIO implementations to Iceberg
>>> recently. I would imagine some native S3 features like ACL or SSE to not
>>> work for GCS, and some GCS features to be not supported in S3FileIO, so I
>>> think a specific GCS FileIO would likely be better for GCS support in the
>>> long term.
>>>
>>>
>>>
>>> Could you describe how you configure S3FileIO to talk to GCS? Do you
>>> need to override the S3 endpoint or have any other configurations?
>>>
>>>
>>>
>>> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
>>> feasible long-term solution? Are there any GCS specific features that you
>>> might need and could not be done through S3FileIO, and how widely used are
>>> those features?
>>>
>>>

Re: Proposal: Support for views in Iceberg

2021-07-20 Thread Piotr Findeisen
Hi,

FWIW, in Trino we just added Trino views support.
https://github.com/trinodb/trino/pull/8540
Of course, this is by no means usable by other query engines.

Anjali, your document does not talk much about compatibility between query
engines.
How do you plan to address that?

For example, I am familiar with Coral, and I appreciate its powers for
dealing with legacy stuff like views defined by Hive.
I treat it as a great technology supporting transitioning from a query
engine to a better one.
However, I would not base a design of some new system for storing
cross-engine compatible views on that.

Is there something else we can use?
Maybe the view definition should use some intermediate structured language
that's not SQL?
For example, it could represent logical structure of operations in
semantics manner.
This would eliminate need for cross-engine compatible parsing and analysis.

Best
PF



On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray  wrote:

> Thanks Anjali!
>
> I have left some comments on the document. I unfortunately have to miss
> the community meetup tomorrow but would love to chat more/help w/
> implementation.
>
> Best,
> Ryan
>
> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>  wrote:
>
>> Hello,
>>
>> John Zhuge and I would like to propose the following spec for storing
>> view metadata in Iceberg. The proposal has been implemented [1] and is in
>> production at Netflix for over 15 months.
>>
>>
>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>
>> [1]
>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>
>> Please let us know your thoughts by adding comments to the doc.
>>
>> Thanks,
>> Anjali.
>>
>


Re: string bucketing compatibility issue

2021-07-19 Thread Piotr Findeisen
Hi,

I've filed https://github.com/apache/iceberg/issues/2837 for this as well.

Best
PF



On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen 
wrote:

> Hi,
>
> It was discovered by @Mateusz Gajewski
>  that Iceberg bucketing
> transformation for string isn't regular Murmur3 32-bit hash.
>
> Upon closer investigation we found out that the code:
>
>
> https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239
>
> is affected by Guava issue https://github.com/google/guava/issues/5648
> that causes wrong results for input containing surrogate pairs (Unicode
> codepooints outside of Basic Multilingual Plane).
>
> Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
> the proposed fix), this can cause incorrect query results, since bucketing
> function definition will effectively change.
>
> This is mostly FYI, unless we can do something more about it.
>
> Best
> PF
>
>


Re: Proposal: Support for views in Iceberg

2021-07-22 Thread Piotr Findeisen
nd translate back
>> and forth between engine-supported SQL and AST
>> 4. Intermediate structured language of our own. (What additional
>> functionality does it provide over Calcite?)
>>
>> Given that the view metadata is json, it is easily extendable to
>> incorporate any new fields needed to make the SQL truly compatible across
>> engines.
>>
>> What do you think?
>>
>> regards,
>> Anjali
>>
>>
>>
>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen 
>> wrote:
>>
>>> Hi,
>>>
>>> FWIW, in Trino we just added Trino views support.
>>> https://github.com/trinodb/trino/pull/8540
>>> Of course, this is by no means usable by other query engines.
>>>
>>> Anjali, your document does not talk much about compatibility between
>>> query engines.
>>> How do you plan to address that?
>>>
>>> For example, I am familiar with Coral, and I appreciate its powers for
>>> dealing with legacy stuff like views defined by Hive.
>>> I treat it as a great technology supporting transitioning from a query
>>> engine to a better one.
>>> However, I would not base a design of some new system for storing
>>> cross-engine compatible views on that.
>>>
>>> Is there something else we can use?
>>> Maybe the view definition should use some intermediate structured
>>> language that's not SQL?
>>> For example, it could represent logical structure of operations in
>>> semantics manner.
>>> This would eliminate need for cross-engine compatible parsing and
>>> analysis.
>>>
>>> Best
>>> PF
>>>
>>>
>>>
>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray  wrote:
>>>
>>>> Thanks Anjali!
>>>>
>>>> I have left some comments on the document. I unfortunately have to miss
>>>> the community meetup tomorrow but would love to chat more/help w/
>>>> implementation.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>  wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> John Zhuge and I would like to propose the following spec for storing
>>>>> view metadata in Iceberg. The proposal has been implemented [1] and is in
>>>>> production at Netflix for over 15 months.
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>
>>>>> [1]
>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>
>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>
>>>>> Thanks,
>>>>> Anjali.
>>>>>
>>>>


Re: Proposal: Z-Ordering in Iceberg

2021-07-22 Thread Piotr Findeisen
Hi Bhavyam,

Has this been discussed on the sync?
Ryan, will it be making into the table metadata spec?

Best,
PF

On Wed, Jul 21, 2021 at 1:50 PM Bhavyam Kamal 
wrote:

> Hi Everyone,
>
> I would like to discuss and get feedback on the following proposal for
> Z-Ordering in the Iceberg Sync today:
>
>
> https://docs.google.com/document/d/1UfGxaB7qlrGzzMk9pBm03oKPOkm-jk-NQVQQvHP-0Bc/edit?usp=sharing
>
> Please let me know if you have any thoughts or suggestions by adding
> comments in the doc.
>
> Thanks and regards,
> Bhavyam
>
>


string bucketing compatibility issue

2021-07-16 Thread Piotr Findeisen
Hi,

It was discovered by @Mateusz Gajewski
 that
Iceberg bucketing transformation for string isn't regular Murmur3 32-bit
hash.

Upon closer investigation we found out that the code:

https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239

is affected by Guava issue https://github.com/google/guava/issues/5648 that
causes wrong results for input containing surrogate pairs (Unicode
codepooints outside of Basic Multilingual Plane).

Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
the proposed fix), this can cause incorrect query results, since bucketing
function definition will effectively change.

This is mostly FYI, unless we can do something more about it.

Best
PF


Re: [DISCUSS] Distinct count map

2021-07-23 Thread Piotr Findeisen
Hi,

File level distinct count (a number) has limited applicability in Trino.
It's useful for pointed queries, where we can prune all the other files
away, but in other cases, Trino optimizer wouldn't be able to make an
educated use of that.

Internally, Łukasz and I we were talking about sketches like HLL as well
and i am happy to see them being mentioned here too.
Do you have any design plans for that already?
Did you consider making them part of file metadata?

Of course for this to be useful, we would need to have a well defined hash
function (we already have it for bucketing purposes), as well as portable
representation that can be imported by a query engine.

Best,
PF





On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue  wrote:

> I feel it’s better to ensure as much correctness in the statistics as
> possible and then to let the engines make educated decisions about how they
> want to work on that information.
>
> I agree with this, but I’m wondering where the line is for “as much
> correctness … as possible”.
>
> It hadn’t occurred to me that someone might compact two files and also
> merge the distinct counts. In that case, I completely agree that we should
> require that the distinct count should be based on the data and not on a
> calculation from other distinct counts.
>
> But, I think it is probably important that these can be set from sketches
> that don’t require keeping a set of full values. Otherwise, it could be
> prohibitively expensive to produce them.
>
> It probably makes sense to be clear in the docs, like Jack suggests: this
> is an estimate of the number of distinct values produced from the actual
> data, not from merging estimates (but possibly from merging the underlying
> sketches).
>
> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye  wrote:
>
>> Yes I think Dan has a good point here that I was trying to get to, the
>> correctness aspect of it is the major reason that led me to consider the
>> upper and lower bound approach, otherwise as Ryan described, the current
>> count metrics could already be sufficient for planning purposes. With a
>> bound, at least that bound can always be calculated correctly during any
>> operation, whereas the distinct value count might drift if we use some
>> heuristics. So if we decide to go with adding the count, it should be a
>> nullable count such that if we cannot decide the true value it can be
>> omitted. Or it could directly be defined as a distinct value estimate, but
>> in that case I would prefer to have it as some sort of sketch implemented
>> as a secondary index.
>>
>> -Jack Ye
>>
>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks  wrote:
>>
>>> Jack, that's the same thought I had initially but I think we can
>>> actually break this down into two separate issues.
>>>
>>> One is on the scan side which is how do we merge the information that we
>>> have and I think that would you're describing is something that we can do
>>> even without storing the lower and upper bounds.  The advantage is that on
>>> scan side you can actually use other information that you have in order to
>>> make better decisions about how to do that merging. If all you have is
>>> lower and upper bounds you may actually lose a little bit of that fidelity
>>> based on previous merges, compactions, etc.
>>>
>>> On the file side I'm a little concerned about using statistics that can
>>> drift over time. If we're merging files stats can quickly become
>>> non-representative for the actual data in the files themselves.  Beyond
>>> merges even compactions can impact the actual stats within the file so in
>>> many cases you would need to recalculate them anyway.
>>>
>>> I feel it's better to ensure as much correctness in the statistics as
>>> possible and then to let the engines make educated decisions about how they
>>> want to work on that information.
>>>
>>> I'd vote for having distinct counts in the stats but requiring that they
>>> be accurate. I feel like it's better to require that they're dropped in the
>>> event that the cannot be accurate.  This may cause some problems with
>>> row-level deletes though so we have to be a little careful with the
>>> implementation.
>>>
>>> -Dan
>>>
>>>
>>>
>>>
>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye  wrote:
>>>
 What about instead of distinct count, we introduce min and max possible
 distinct count? In the best case scenario, min and max equals, and we know
 exactly how many distinct values there are, and we can directly update the
 new distinct count. In the worst case, when merging 2 unsorted files,
 without the need for complicated estimation, we can know the max for file1
 and file2 is (max_file1 + max_file2), and the min is max(min_file1,
 min_file2). If files are merged without sort order, then this gap will
 continue to grow and become unable to provide as much useful information to
 planning. But when we perform a sort for rows in files, we can update the
 min and max to the same and reduce this 

Re: Proposal: Support for views in Iceberg

2021-07-27 Thread Piotr Findeisen
 for predicate pushdown. So I think it would not be hard to extend on that
>> to support this use case. Different engines can build this logical
>> structure when traversing their own AST during a create view query.
>>
>> 3. With these considerations, I think the "sql" field can potentially be
>> a map (maybe called "engine-sqls"?), where key is the engine type and
>> version like "Spark 3.1", and value is the view SQL string. In this way,
>> the engine that creates the view can still read the SQL directly which
>> might lead to better engine-native integration and avoid redundant parsing.
>> But in this approach there is always a default intermediate representation
>> it can fallback to when the engine's key is not found in the map. If we
>> want to make incremental progress and delay the design for the intermediate
>> representation, I think we should at least use this map instead of just a
>> single string.
>>
>> Thanks,
>> Jack Ye
>>
>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen 
>> wrote:
>>
>>> Hi,
>>>
>>> First of all thank you for this discussion and all the view-related work!
>>>
>>> I agree that solving cross-engine compatibility problem may not be
>>> primary feature today, I am concerned that not thinking about this from the
>>> start may "tunnel" us into a wrong direction.
>>> Cross-engine compatible views would be such a cool feature that it is
>>> hard to just let it pass.
>>>
>>> My thinking about a smaller IR may be a side-product of me not being
>>> familiar enough with Calcite.
>>> However, with new IR being focused on compatible representation, and not
>>> being tied to anything are actually good things.
>>> For example, we need to focus on JSON representation, but we don't need
>>> to deal with tree traversal or anything, so the code for this could be
>>> pretty simple.
>>>
>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>> across engines in the view definition (this is how currently Netflix uses
>>> these 'common' views across Spark and Trino)
>>>
>>> it's interesting. Anjali, do  you have means to enforce that, or is this
>>> just a convention?
>>>
>>> What are the common building blocks (relational operations, constructs
>>> and functions) that you found sufficient for expressing your views?
>>> Being able to enumerate them could help validate various approaches
>>> considered here, including feasibility of dedicated representation.
>>>
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>>
>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray  wrote:
>>>
>>>> Hey Anjali,
>>>>
>>>> I am definitely happy to help with implementing 1-3 in your first list
>>>> once the spec has been approved by the community. My hope is that the final
>>>> version of the view spec will make it easy to re-use existing rollback/time
>>>> travel/metadata etc functionalities.
>>>>
>>>> Regarding SQL dialects.My personal opinion is: Enforcing ANSI-compliant
>>>> SQL across all engines is hard and probably not desirable while storing
>>>> Calcite makes it hard for eg python to use views. A project to make a cross
>>>> language and cross engine IR for sql views and the relevant transpilers is
>>>> imho outside the scope of this spec and probably deserving an apache
>>>> project of its own. A smaller IR like Piotr suggested is possible but I
>>>> feel it will likely quickly snowball into a larger project and slow down
>>>> adoption of the view spec in iceberg. So I think the most reasonable way
>>>> forward is to add a dialect field and a warning to engines that views are
>>>> not (yet) cross compatible. This is at odds with the original spirit of
>>>> iceberg tables and I wonder how the border community feels about it? I
>>>> would hope that we can make the view spec engine-free over time and
>>>> eventually deprecate the dialect field.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> PS if anyone is interested in collaborating on engine agnostic views
>>>> please reach out. I am keen on exploring this topic.
>>>>
>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>  wrote:
>>>>
>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments

Re: [DISCUSS] Moving to apache-iceberg Slack workspace

2021-07-27 Thread Piotr Findeisen
Hi,

I don't have opinion which Slack workspace this is in, as long as it's easy
to join.
Manual joining process is not healthy for sure.
Btw, the apache-iceberg is currently limited to @apache.org emails, which
some people do not have (e.g. i do not).
Will you be sharing an invite link or something?

Best,
PF



On Sun, Jul 25, 2021 at 9:48 PM Ryan Blue  wrote:

> Hi everyone,
>
> A few weeks ago, we talked about whether to use a Slack workspace for
> Iceberg or whether to continue using the ASF's workspace. At the time, we
> thought it was possible for anyone to invite themselves to the ASF's Slack,
> so we thought it would be best to use the common one. But after the
> self-invite link broke recently, we found out that ASF infra doesn't want
> to fix the link anymore because of spammers, which leaves us without a way
> for people to join us on Slack without requesting an invite.
>
> I think that requesting an invite is too much friction for someone that
> wants to join this community, so I propose we reconsider our previous
> decision and move to apache-iceberg.slack.com.
>
> Any objections?
>
> Ryan
>
>
> --
> Ryan Blue
>


Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-07 Thread Piotr Findeisen
Hi Zaicheng,

thanks for following up on this. I'm certainly interested.
The proposed time doesn't work for me though, I'm in the CET time zone.

Best,
PF


On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang 
wrote:

> Hi dev folks,
>
> As discussed in the sync
> 
> meeting, we will have a dedicated meeting on this topic.
> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
> link is https://meet.google.com/ttd-jzid-abp
> Please let me know if the time does not work for you.
>
> Thanks,
> Zaicheng
>
> zaicheng wang  于2022年3月2日周三 21:17写道:
>
>> Hi folks,
>>
>> This is Zaicheng from bytedance. We spend some time working on solving
>> the index invalidation problem as we discussed in the dev email channel.
>> And when we are working on the POC, we also realize there are some
>> metadata changes that might be introduced.
>> We put these details into a document:
>>
>> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
>> The document includes two proposals for solving the index invalidation
>> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>>  another one is by leveraging the current manifest entry structure. The
>> document will also describe the corresponding table spec change.
>> Please let me know if you have any thoughts. We could also discuss this
>> during the sync meeting.
>>
>> Thanks,
>> Zaicheng
>>
>> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye  wrote:
>>
>>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
>>>
>>> The goal here is to have a monotonically increasing number that could be
>>> used to detect what files have been newly added and should be indexed. This
>>> is especially important to know how up-to-date an index is for each
>>> partition.
>>>
>>> In a table without compaction, sequence number of files would continue
>>> to increase. If we have indexed all files up to sequence number 3, we know
>>> that the next indexing process needs to index all the files with sequence
>>> number greater than 3. But during compaction, files will be rewritten with
>>> the starting sequence number. During commit time the sequence number might
>>> already gone much higher. For example, I start compaction at seq=3, and
>>> when this is running for a few hours, there are 10 inserts done to the
>>> table, and the current sequence number is 13. When I commit the compacted
>>> data files, those files are essentially written to a sequence number older
>>> than the latest. This breaks a lot of assumption like (1) I cannot just
>>> find new data to index by calculating if the sequence number is higher than
>>> certain value, (2) a reader cannot determine if an index could be used
>>> based on the sequence number.
>>>
>>> The solution I was describing is to have another watermark that is
>>> monotonically increasing regardless of compaction or not. So Compaction
>>> would commit those files at seq=3, but the new watermark of those files are
>>> at 14. Then we can use this new watermark for all the index operations.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang 
>>> wrote:
>>>
 Hi Jack,


 Thanks for the summary and it helps me a lot.

 Trying to understand point 2 and having my 2 cents.

 *a mechanism for tracking file change is needed. Unfortunately sequence
 numbers cannot be used due to the introduction of compaction that rewrites
 files into a lower sequence number. Another monotonically increasing
 watermark for files has to be introduced for index change detection and
 invalidation.*

 Please let me know if I have some wrong/silly assumptions.

 So the *reason* we couldn't use sequence numbers as the validness
 indicator of the index is compaction. Before compaction (taking a very
 simple example), the data file and index file should have a mapping and the
 tableScan.planTask() is able to decide whether to use index purely by
 comparing sequence numbers (as well as index spec id, if we have one).

 After compaction, the tableScan.planTask() couldn't do so because data
 file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
 might be returned.

 I wonder how an additional watermark only for the index could solve the
 problem?


 And based on my gut feeling, I feel we could somehow solve the problem
 with the current sequence number:

 *Option 1*: When compacting, we could compact those data files that
 index is up to date to one group, those files that index is stale/not exist
 to another group. (Just like what we are doing with the data file that are
 unpartitioned/partition spec id not match).

 The *pro* is that we could still leverage indexes for part of the data
 files, and we could 

Iceberg NDV stats

2022-03-16 Thread Piotr Findeisen
Hi,

We at Starburst are looking into adding number distinct values (NDV)
statistics to Iceberg tables, to let e.g. the Trino cost-based query
optimizer produce better plans when working with Iceberg tables.

The initial approach is for table-level statistics, and may be improved in
the future.
I would appreciate feedback on the design doc
https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY


This stats topic is related to Secondary Indexes, but we need slightly
different terminology and mechanics for both. For example, indexes need to
be exact, and properly invalidated. Statistics may be outdated and still
useful, so these two things need to be coherent but separate.

Best
PF


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Piotr Findeisen
Hi Ajantha,

Thank you for spending the time to look into this.

re a: I think I remember Ryan saying Parquet isn't good for bigger pieces
of data, and some stats sketches or indices can be bigger than others.
Also, the Parquet row logical / columnar storage format doesn't give as
much benefit for what's more closer to key-value storage

re b:
this is still tbd --
eg https://github.com/apache/iceberg/pull/4945
https://github.com/apache/iceberg/pull/5021

re c, e:
for partition-level, it's not decided yet how it will be handled

re d:
yes, ANALYZE can be separate operation, see
https://github.com/trinodb/trino/pull/12317 for POC

Best regards,
PF



On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat  wrote:

> Thank you Piotr for all of the work you’ve put into this.
>
> I just checked the spec. I have a few newbie questions.
>
> a. Instead of using an existing columnar format like parquet (one file for
> one type of stats) to store indexes, any reason why we have developed our
> own format and any benchmarks taken against Puffin vs other formats?
>
> b. How these Puffin files are linked to Iceberg's metadata files is still
> a missing link for me. As the Puffin spec says, these stats are table level
> (updated per snapshots). So, do we need an Iceberg spec change to store the
> file names of these Puffin files so that remove_orphan_files will not
> clean it up accidentally? (also needed for expire_snapshots)
>
> c. NDV's are column level stats. So, I expect the latest puffin file of
> that snapshot will have one row of stats representing stats for each
> column. But if we are to implement secondary index or table level partition
> stats, there can be many rows (millions) in puffin based on the dataset.
> So, for every commit, do we need to read the previous snapshot's Puffin
> file and write back a new file with updated stats? (the file might be very
> huge when data grows?). I think it will affect the commit time. Any
> thoughts on this?
>
> d. Slightly related to the above point, do we plan to asynchronously
> support collecting the stats like "ANALYZE table" and modify the table
> metadata with the stats file names? (might need an Iceberg commit to write
> new table metadata)
>
> e. Even though table level partition stats are available from _parition
> metadata table (along with filter push down support), computing metadata
> table per query will be expensive.
> Hence, we are looking forward to storing them in the Puffin format. But
> I'm not sure about storing it as a single file with millions of rows.
> I Would like to collaborate and discuss more on this.
>
> Thanks,
> Ajantha
>
> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang 
> wrote:
>
>> +1 on the format! It looks great!
>>
>>
>>
>> Thanks for materializing the initial design idea.
>>
>>
>>
>> Miao
>>
>> *From: *Kyle Bendickson 
>> *Date: *Sunday, June 12, 2022 at 1:55 PM
>> *To: *dev@iceberg.apache.org 
>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for
>> statistics and indexes
>>
>> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>>
>>
>>
>> +1 [non-binding]
>>
>>
>>
>> Thank you Piotr for all of the work you’ve put into this.
>>
>>
>>
>> This should greatly benefit not only Iceberg on Trino, but hopefully can
>> be used in many novel ways due to its well thought out generic design and
>> incorporation of the ability to extend with new sketches.
>>
>>
>>
>> Looking forward to the improvements this will bring.
>>
>>
>>
>> - Kyle
>>
>>
>>
>> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
>> wrote:
>>
>> +1, let's do it!
>>
>>
>>
>> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:
>>
>> +1  Looking forward to the features it enables.
>>
>>
>>
>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>>
>> +1. Looking forward to the partition stats.
>>
>> Best,
>>
>>
>>
>> Yufei
>>
>>
>>
>>
>>
>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>>
>> +1 as well.  Excited about the progress here.
>>
>>
>>
>> -Dan
>>
>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
>> wrote:
>>
>> +1, really nice! Indexes are coming!
>>
>>
>>
>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
>> wrote:
>>
>> +1, it's an exciting step for Iceberg, look forward to all the new
>> statistics and secondary indices it will allow.
>>
>>
>>
>> Had a few questions of what

[VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Piotr Findeisen
Hi Everyone,

I propose that we adopt Puffin file format as a file format for statistics
and indexes in Iceberg tables.

Puffin file format specification:
https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
(previous discussions:  https://github.com/apache/iceberg/pull/4944,
https://github.com/apache/iceberg-docs/pull/69)

Intend use:
* statistics in Iceberg tables (see
https://github.com/apache/iceberg/pull/4945 and associated proposed
implementation https://github.com/apache/iceberg/pull/4741)
* in the future: storage for secondary indexes

Puffin file reader and writer implementation:
https://github.com/apache/iceberg/pull/4537

Thanks,
PF


Re: Positional delete with vs without the delete row values

2022-05-09 Thread Piotr Findeisen
Hi Peter,

FWIW, Trino Iceberg connector writes deletion files with just positions,
without row data. cc @Alexander Jo 

> For the 1st point we just need to collect the statistics during the
delete, but we do not have to actually persist the data.

I would be weary of creating ORC/Parquet files with statistics that do not
match actual file contents.


> Do I miss something? Is there a use-case when using positional deletes
with row values is significantly more effective?

I recall some mention of CDC use-case -- producing CDC events from changes
to a table. But I think I recall someone mentioning this usually ends up
needing to join with actual data files anyway.
@Ryan Blue  will know better, but in the meantime you can
probably also dig the topic up in the mailing list.

Best
PF






On Thu, May 5, 2022 at 3:59 PM Peter Vary 
wrote:

> Hi Team,
>
> We are working on integrating Iceberg V2 tables with Hive, and enabling
> delete and update operations.
> The delete is implemented by Marton and the first version is already
> merged: https://issues.apache.org/jira/browse/HIVE-26102
> The update statement is still in progress:
> https://issues.apache.org/jira/browse/HIVE-26136
> The edges are a bit rough for the time being, so don’t use this in
> production :D
>
> During the implementation we found that implementing deletes was quite
> straightforward with the Iceberg positional deletes, and without much
> effort we were able to provide the row values too. OTOH for updates we need
> to sort the delete files and the data files differently. ATM we have only a
> single result table, so we ended up implementing our own writer which is
> very similar to
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/SortedPosDeleteWriter.java
>  to
> do the sorting of the delete records for us. The problem with the
> SortedPosDeleteWriter is that when the record size grows then the number of
> records we can keep in memory decreases. So we ended up with our own writer
> which stores only the minimal things in memory and writes only positional
> deletes without the actual row values. See:
> https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergBufferedDeleteWriter.java
>
> The question is:
> - What is the experience of the community? When it is beneficial to have
> the row values in the positional delete files in production?
>
> My feeling is:
>
>1. The row data is best used when there is a filter in the query and
>we can filter out whole delete files when running the query.
>2. There could be a slight improvement when we can skip
>RowGroups/Stripes based on the filter
>
>
> For the 1st point we just need to collect the statistics during the
> delete, but we do not have to actually persist the data.
> Would it be viable to create these delete files where the statistics could
> not be calculated directly from the files themselves?
> Would the community accept these files?
>
> OTOH we have significant downsides for positional deletes with row values:
>
>1. The delete file size increases significantly
>2. We should keep smaller delete row RowGroup/Stripe size to
>accommodate the bigger amount of raw data - so we have to read more footers
>and adding IO overhead
>
>
> So my feeling is that generally speaking positional deletes without the
> actual row data would be more performant the positional deletes with row
> data.
>
> Do I miss something? Is there a use-case when using positional deletes
> with row values is significantly more effective?
>
> Thanks,
> Peter
>
>


Re: Change default format-version of our forked Iceberg to v2

2023-01-11 Thread Piotr Findeisen
Hi,

FWIW Trino already creates v2 tables by default.
Thought it's worth sharing for context.

Best
PF




On Tue, Jan 10, 2023 at 10:09 AM Manu Zhang  wrote:

> Hi all,
>
> We've maintained a forked Iceberg internally and all our use cases involve
> v2 tables with row-level updates and deletes. Our users need to remember to
> create table with the `'format-version'='2'` option or alter
> table afterwards.
>
> I'm thinking about changing the default format-version of our
> forked Iceberg to v2 . Is there any concern for this change? Any hidden
> issues I've missed?
>
> Thanks,
> Manu
>


Re: [Proposal] Partition stats in Iceberg

2022-11-23 Thread Piotr Findeisen
Hi Ajantha,

this is very interesting document, thank you for your work on this!
I've added a few comments there.

I have one high-level design comment so I thought it would be nicer to
everyone if I re-post it here

is "partition" the right level of keeping the stats?
> We do this in Hive, but was it an accidental choice? or just the only
> thing that was possible to be implemented many years ago?


> Iceberg allows to have higher number of partitions compared to Hive,
> because it scales better. But that means partition-level may or may not be
> the right granularity.


> A self-optimizing system would gather stats on "per query unit" basis --
> for example if i partition by [ day x country ], but usually query by day,
> the days are the "query unit" and from stats perspective country can be
> ignored.
> Having more fine-grained partitions may lead to expensive planning time,
> so it's not theoretical problem.


> I am not saying we should implement all this logic right now, but I think
> we should decouple partitioning scheme from stats partitions, to allow
>  query engine to become smarter.



cc @Alexander Jo 

Best
PF




On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat  wrote:

> Hi Community,
> I did a proposal write-up for the partition stats in Iceberg.
> Please have a look and let me know what you think. I would like to work on
> it.
>
>
> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>
> Requirement background snippet from the above document.
>
>> For some query engines that use cost-based-optimizer instead or along
>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
>> it is good to know the partition level stats like total rows per
>> partition and total files per partition to take decisions for CBO (
>> like deciding on the join reordering and join type, identifying the
>> parallelism).
>> Currently, the only way to do this is to read the partition info from 
>> data_file
>> in manifest_entry of the manifest file and compute partition-level
>> statistics (the same thing that ‘partitions’ metadata table is doing [see
>> Appendix A
>> 
>> ]).
>> Doing this on each query is expensive. Hence, this is a proposal for
>> computing and storing partition-level stats for Iceberg tables and using
>> them during queries.
>
>
>
> Thanks,
> Ajantha
>


Re: [VOTE] Release Apache Iceberg 1.1.0 RC4

2022-11-28 Thread Piotr Findeisen
Hi,

https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-core/1.1.0/
is already published (on Nov 22, so before voting was concluded)
Is it "the" release, or there will be new tag pushed to maven central?

best,
PF


On Mon, Nov 28, 2022 at 9:18 AM Gabor Kaszab  wrote:

>
> Hey All,
>
> I kept the voting open longer than the usual 72 hours due to Thanksgiving
> in the US. We now have a critical mass and the *vote passes* with the
> following results:
>
> +1: 13 (including 3 binding)
> without having any 0 or -1.
>
> Thanks to everyone who verified the release and voted!
> Also huge thanks to *Fokko* for creating all those release candidates!
>
> I'll proceed with the next steps right away.
>
> Cheers,
> Gabor
>
> On Mon, Nov 28, 2022 at 1:07 AM OpenInx  wrote:
>
>> +1 (binding)
>>
>> 1. Download the source tarball, signature (.asc), and checksum (.sha512):
>>   OK
>> 2. Import gpg keys: download KEYS and run gpg --import
>> /path/to/downloaded/KEYS.txt (optional if this hasn’t changed) :  OK
>> 3. Verify the signature by running: gpg --verify
>> apache-iceberg-1.1.0.tar.gz.asc :  OK
>> 4. Verify the checksum by running: shasum -a 512
>> apache-iceberg-1.1.0.tar.gz  :  OK
>> 5. Untar the archive and go into the source directory: tar xzvf
>> apache-iceberg-1.1.0.tar.gz && cd apache-iceberg-1.1.0:  OK
>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>
>> Thanks.
>>
>> On Mon, Nov 28, 2022 at 3:05 AM Daniel Weeks  wrote:
>>
>>> +1 (binding)
>>>
>>> Verified sigs/sums/licenses/build/test
>>> Built and tested with JDK 8
>>>
>>> -Dan
>>>
>>> On Fri, Nov 25, 2022 at 5:58 PM leilei hu  wrote:
>>>
 +1(non-binding)
 verified(java 8):

 - Create table using HiveCatalog and HadoopCatalog
 - Spark Structured Streaming with Spark 3.2.1
 - Spark query with Spark’s DataSourceV2 API
 - Ran build with JDK8

 2022年11月24日 上午12:13,Cheng Pan  写道:

 +1 (non-binding)

 Passed integration test[1] w/ Apache Kyuubi

 [1] https://github.com/apache/incubator-kyuubi/pull/3810

 Thanks,
 Cheng Pan


 On Nov 23, 2022 at 16:13:34, Ajantha Bhat 
 wrote:

> +1 (non-binding)
>
> - verified tests against spark-3.3 runtime jar with Nessie catalog.
> - verified the contents of the iceberg-spark-runtime-3.3_2.12-1.1.0.jar
> - checked for spark-3.0 removal
> - validated checksum and signature
> - checked license docs & ran RAT checks
> - ran build with JDK1.8
>
> Thanks,
> Ajantha
>
> On Tue, Nov 22, 2022 at 9:49 PM Gabor Kaszab 
> wrote:
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official Apache 
>> Iceberg 1.1.0 release.
>>
>> The commit ID is ede085d0f7529f24acd0c81dd0a43f7bb969b763
>> * This corresponds to the tag: apache-iceberg-1.1.0-rc4
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.1.0-rc4
>> * 
>> https://github.com/apache/iceberg/tree/ede085d0f7529f24acd0c81dd0a43f7bb969b763
>>
>> The release tarball, signature, and checksums are here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.1.0-rc4
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on Nexus. The Maven repository 
>> URL is:
>> * 
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1114/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 1.1.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>>



Re: [DISCUSS] Hive locks removal

2023-01-20 Thread Piotr Findeisen
Hi Peter

Thanks for bringing this issue up and thanks for working on it as well.
I haven't experienced these problems first-hand, so don't have opinion yet
on what the solution should or shouldn't be.

With this new approach, is there any plan to address situations where more
than one application or application version is writing to a metastore? In
such scenario, some applications may be updated to the new commit mechanism
and some may not, leading to locks being ineffective (=> data corruption).
Or, is the assumption that in a given deployment all applications are
switched to the new way at the same time?

It's awesome that you shared all these links for the context. For even more
context, the only think I can add is that Trino has it's own,
Iceberg-compatible, implementation of Hive Catalog and the table lock is
acquired here:
https://github.com/trinodb/trino/blob/9feb3d1fb13e0374a17c3f96535342872aa19be5/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/hms/HiveMetastoreTableOperations.java#L68-L72

Best
PF



On Mon, Jan 16, 2023 at 2:18 PM Péter Váry 
wrote:

> Hi Team,
>
> I have seen several complaints when HiveCatalog was used and the locks
> were not removed correctly [1], [2] and suggested solutions [3]. Also did
> my share to fix the situation as much as possible [4]. The issue is that
> every fix is just a band-aid when the client can disappear and leave behind
> an abandoned lock, this way preventing further changes on the given Iceberg
> table.
>
> To fix this I propose the following solution:
> 1. We can provide a minimal Hive feature to check the previous metadata
> location transactionally before altering a table and setting the new
> snapshot
> 2. We can have a configuration to turn off using the Hive locks in the
> HiveCatalog and we can rely on the new Hive feature to provide us the
> atomicity which is needed for an Iceberg commit
>
> I have created the above mentioned minimal Hive feature request
> (HIVE-26882 [5]) and the PR to fix it [6] also, with the approval of the
> Hive community backported and merged this feature to the active branches:
> branch-3 [7], branch-3.1 [8], branch-2 [9], branch-2.3 [10]. This means
> that any of the next Hive releases will contain this fix, and any company
> having their own forks could easily cherry-pick this minimal change. The
> first upcoming upstream Hive release with this feature would be the Hive
> 3.2.0 which is expected to come around 2-3 months [11].
>
> Also I have created the Iceberg PR to start using the new feature [12].
> This has 2 parts:
> 1. Rebased Adam Szita's work [13] which refactored out the Lock related
> features on a single class
> 2. Added a small patch which adds the possibility to use the new feature
> and turn of the locking [14] for HiveCatalog tables
>
> Using the new feature would be beneficial in multiple ways:
> - Enhances stability of the writers
> - Improves the performance of the commits, as fewer number of HMS calls
> are needed for a commit operation
>
> My question for the community is:
> - Do we want to merge both PRs to the Iceberg code, or do we want to merge
> only the refactor PR (1) to the Iceberg code and only merge the PR with the
> new configuration (2) after there is an Apache Hive release with the new
> Hive feature?
>
> I would vote for merging the configuration PR too because of the following
> reasons:
> - Many companies are using their own fork of Hive - they do not need to
> wait for the Apache release to add a new feature
> - The PR itself is small, and by default it is turned off - will not cause
> to many issues with maintaining the code
> - We will not forget to add the change when the Apache Hive is released
>
> What are your thoughts?
>
> If you have time, please review the PR.
>
> Thanks,
> Peter
>
> [1] Should do the best efforts to release hive table lock #3336 -
> https://github.com/apache/iceberg/pull/3336
> [2] Lock remains in HMS if HiveTableOperations gets killed (direct process
> shutdown - no signals) after lock is acquired  #2301 -
> https://github.com/apache/iceberg/pull/2301
> [3] What is the purpose of Hive Lock ? #6370 -
> https://github.com/apache/iceberg/issues/6370
> [4] Hive: Lock hardening #6451 -
> https://github.com/apache/iceberg/pull/6451
> [5] HIVE-26882: Allow transactional check of Table parameter before
> altering the Table - https://issues.apache.org/jira/browse/HIVE-26882
> [6] master/HIVE-26882: https://github.com/apache/hive/pull/3888
>
> [7] branch-3/HIVE-26882: https://github.com/apache/hive/pull/3943
>
> [8] branch-3.1/HIVE-26882: : https://github.com/apache/hive/pull/3944
>
> [9] branch-2/HIVE-26882: : https://github.com/apache/hive/pull/3946
>
> [10] branch-2.1/HIVE-26882: : https://github.com/apache/hive/pull/3947
>
> [11] Upcoming Hive release:
> https://issues.apache.org/jira/browse/HIVE-26882?focusedCommentId=17677002=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17677002
>
> [12] Hive: Use