Re: [DISCUSS] Simplification of terminologies

Pratyaksh Sharma Tue, 12 Nov 2019 22:43:35 -0800

All the three proposals seem to make life easier for Hudi users. +1 for all
of them.


On Wed, Nov 13, 2019 at 8:34 AM Vinoth Chandar <[email protected]> wrote:

> Thanks everyone for the feedback. Looks like we are in general agreement.
>
> I am inclined to just do 1 & 2 and leave COPY_ON_WRITE as is based on great
> points Ethan and Shiyan raised. Makes sense..
> Will wait for 1-2 days still to close this thread.
>
> @semanticbeeing Thats a great idea. Is it more like a technical glossary of
> sorts? Lets may be start a different DISCUSS thread on that specific topic,
> so everyone can chime in and provide more attention to that proposal?
>
>
>
>
>
> On Tue, Nov 12, 2019 at 2:44 PM Y. Ethan Guo <[email protected]>
> wrote:
>
> > +1 on [1] and [2].
> >
> > For [3], I have similar doubts as Shiyan.
> >
> > For the naming, I can understand the original intent of the analogy for
> COW
> > which is to make another "copy" of columnar/parquet file upon the
> > modification/update to the records in the file.  From the system design
> > point of view, it's easy to understand.  I'm okay with the renaming as
> > "MERGE_ON_WRITE" since it's probably straightforward for users at the
> first
> > glance.
> >
> > In terms of the concept, COW and MOR are listed as storage/table types.
> > From my understanding, they represent different tradeoffs of the
> > performance between reading and writing Hudi tables, and within MOR there
> > are different tradeoffs, e.g., lazy merge on read or periodic compaction
> > and cleaning pipelined along ingestion. It looks like these can be
> > controlled through configs, e.g., "disable_merge_on_write",
> > "compaction_frenquency", etc., instead of fixing the storage type, to
> > control the tradeoff that a user would like to make.  The requirement may
> > change so a user can switch between COW and MOR by tuning the configs. We
> > don't have to make such changes now, but I'm wondering if this is
> something
> > worth considering in the future releases.
> >
> > - Ethan
> >
> > On Tue, Nov 12, 2019 at 8:43 AM nishith agarwal <[email protected]>
> > wrote:
> >
> > > +1 on the first two, don't feel strongly about (3).
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Tue, Nov 12, 2019 at 5:03 AM leesf <[email protected]> wrote:
> > >
> > > > [1] +1. `views` indeed confused me a lot.
> > > > [2] +1. `snapshot` is more reasonable.
> > > > [3] I don't feel very strong to rename it, the current name
> > > `COPY_ON_WRITE`
> > > > is reasonable considering the cost to rename and the behavior that
> new
> > > > version parquet file will be created and seems to be copied from old
> > > > version parquet file.
> > > >
> > > > Best,
> > > > Leesf
> > > >
> > > > Balaji Varadarajan <[email protected]> 于2019年11月12日周二 下午3:55写道：
> > > >
> > > > > Agree with all 3 changes. The naming now looks more consistent than
> > > > > earlier. +1 on them
> > > > >
> > > > > Depending on whether we are renaming Input formats for (1) and (2)
> -
> > > this
> > > > > could require some migration steps for
> > > > >
> > > > > Balaji.V
> > > > >
> > > > >
> > > > > On Mon, Nov 11, 2019 at 7:38 PM vino yang <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Vinoth,
> > > > > >
> > > > > > Thanks for bringing these proposals.
> > > > > >
> > > > > > +1 on all three. Especially, big +1 on the third renaming
> proposal.
> > > > > >
> > > > > > When I was a newbie. The "COPY_ON_WRITE" term confused me a lot.
> It
> > > > > easily
> > > > > > mislead users on the "copy" term. And make users compare it with
> > the
> > > > > > `CopyOnWriteArrayList` data structure provided by JDK  and
> thoughts
> > > of
> > > > > the
> > > > > > file systems.
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > >
> > > > > > Bhavani Sudha <[email protected]> 于2019年11月12日周二 上午9:05写道：
> > > > > >
> > > > > > > +1 on all three rename proposals. I think this would make the
> > > > concepts
> > > > > > > super easy to follow for new users.
> > > > > > >
> > > > > > > If changing [3] seems to be a stretch, we should definitely do
> > [1]
> > > &
> > > > > [2]
> > > > > > at
> > > > > > > the least IMO. I will be glad to help out on the renames to
> > > whatever
> > > > > > extent
> > > > > > > possible should the Hudi community incline to pursue this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sudha
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > I wanted to raise an important topic with the community
> around
> > > > > whether
> > > > > > we
> > > > > > > > should rename some of our terminologies in code/docs to be
> more
> > > > > > > > user-friendly and understandable..
> > > > > > > >
> > > > > > > > Let me also provide some context for each, since I am
> probably
> > > > guilty
> > > > > > of
> > > > > > > > introducing most of them in the first place :).
> > > > > > > >
> > > > > > > > *1. Rename "views" to "query" : *Instead of saying
> incremental
> > > view
> > > > > or
> > > > > > > > read-optimized view, talk about them as "incremental query"
> and
> > > > > > > > "read-optimized query". The term "view" is very technical,
> and
> > > > what I
> > > > > > was
> > > > > > > > trying to convey was that we ingest/store the data once and
> > > expose
> > > > > > views
> > > > > > > on
> > > > > > > > top. But new users (atleast half dozen of them to me) tend to
> > > > confuse
> > > > > > > this
> > > > > > > > with views/materialized views found in databases. Almost
> always
> > > we
> > > > > talk
> > > > > > > > about views mostly in terms of expected behavior for a query
> on
> > > the
> > > > > > > view. I
> > > > > > > > am proposing to just call these different query types since
> > its a
> > > > > more
> > > > > > > > universally accepted terminology and IMO clearer.
> > > > > > > >
> > > > > > > > *2. Rename "Read-Optimized/Realtime" views to Snapshot views
> +
> > > Have
> > > > > > > > Read-Optimized view only for MOR storage :* This one is
> > probably
> > > > the
> > > > > > > > trickiest. Hudi was always designed with MOR in mind, even as
> > we
> > > > were
> > > > > > > > working on COW storage and consequently we named the pure
> > parquet
> > > > > > backed
> > > > > > > > view as Read-Optimized, hoping to name parquet + avro based
> > view
> > > as
> > > > > > > > Write-Optimized. However, we opted to name it Realtime to
> > > emphasize
> > > > > the
> > > > > > > > data freshness aspect. In retrospect, the views should have
> not
> > > > been
> > > > > > > named
> > > > > > > > after their performance characteristics but rather the
> classes
> > of
> > > > > > queries
> > > > > > > > done on them and guarantees for those (point above #1).
> > Moreover,
> > > > > once
> > > > > > we
> > > > > > > > have parquet embedded into the log format, then the tradeoffs
> > may
> > > > not
> > > > > > be
> > > > > > > > the same anyways.
> > > > > > > >
> > > > > > > > So combining with the renaming proposed in #1, we would end
> up
> > > with
> > > > > the
> > > > > > > > following..
> > > > > > > >
> > > > > > > > Copy-On-Write :
> > > > > > > > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > > > > > > > [Old]  Incremental View => [New] Incremental Query
> > > > > > > >
> > > > > > > > Merge-On-Read:
> > > > > > > > [Old] Realtime View => [New] Snapshot Query
> > > > > > > > [Old] Incremental View => [New] Incremental Query
> > > > > > > > [Old] ReadOptimzied View => [New] Read-Optimized Query (since
> > it
> > > is
> > > > > > read
> > > > > > > > optimized compared to Snapshot query always, at the cost of
> > > staler
> > > > > > data)
> > > > > > > >
> > > > > > > > Both changes #1 & #2 could be simpler changes to just code
> > > > > references,
> > > > > > > docs
> > > > > > > > and configs.. we can support both string for sometime and
> > > deprecate
> > > > > > > > eventually since queries are stateless.
> > > > > > > >
> > > > > > > > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated
> > > since
> > > > > the
> > > > > > > > design was very similar to
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Copy-2Don-2Dwrite&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=y9XF8-75xzGHY4yCbfVVWcIC1sbEXDxitqeAS2A6GoQ&e=
> > > > > > > > filesystems
> > > > > > > > & snapshotting and we once hoped to push some of this logic
> > into
> > > > the
> > > > > > > > storage itself, all in vain. but the name stuck, even though
> > once
> > > > we
> > > > > > had
> > > > > > > > MERGE_ON_READ the focus was often on merge costs etc, which
> the
> > > > name
> > > > > > > > COPY_ON_WRITE does not convey directly. I don't feel very
> > strong
> > > > > about
> > > > > > > this
> > > > > > > > and there is also cost to changing this since its persisted
> > > inside
> > > > > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__hoodie.properties&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=930ugGMXsrqzE-acg9nfeoePBmVjTRG3gD765ihEiqU&e=
> > > and we will support both strings internally in
> > > > code
> > > > > > for
> > > > > > > > backwards compatibility anyway
> > > > > > > >
> > > > > > > > Naming something is very hard (yes, try :)).I believe these
> > > changes
> > > > > > will
> > > > > > > > make the project simpler to understand for everyone out
> there.
> > We
> > > > > also
> > > > > > > have
> > > > > > > > tons of new people here, so I am also happy to let go, if its
> > > > already
> > > > > > > clear
> > > > > > > > :)
> > > > > > > >
> > > > > > > > Please use the bullet number when you share your feedback so
> we
> > > > know
> > > > > > what
> > > > > > > > the discussion is about.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to