Re: [DISCUSS] Formalizing the release process

2020-09-08 Thread Pratyaksh Sharma
Missed this thread, the plan looks good to me as well.

On Wed, Sep 9, 2020 at 8:31 AM Vinoth Chandar  wrote:

> Would love to understand the general skepticism a bit more.
> Is it rooted more on hitting those in the short term? or even in the longer
> run with a better test infrastructure in place?
>
> On Tue, Sep 8, 2020 at 6:42 PM Raymond Xu 
> wrote:
>
> > +1. Also a bit skeptical on monthly minor releases. But can give it a
> try.
> >
> > On Tue, Sep 8, 2020 at 5:55 PM Mehrotra, Udit  >
> > wrote:
> >
> > > +1 on the process.
> > >
> > > On 9/8/20, 5:11 PM, "Vinoth Chandar"  wrote:
> > >
> > > CAUTION: This email originated from outside of the organization. Do
> > > not click links or open attachments unless you can confirm the sender
> and
> > > know the content is safe.
> > >
> > >
> > >
> > > >, bit skeptical on minor version releases every month, but nvm.
> > guess
> > > its
> > > just a rough estimate.
> > >
> > > That's an aspirational goal that we should try to hit. We have all
> > > worked
> > > on teams/projects that shipped at that cadence regularly.
> > > It's a matter of getting our test infrastructure and processes
> > > streamlined
> > > IMO :)
> > >
> > > On Fri, Sep 4, 2020 at 8:29 AM Nishith 
> wrote:
> > >
> > > > +1 on the process
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Sep 3, 2020, at 8:14 AM, Sivabalan 
> > wrote:
> > > > >
> > > > > +1 on the general release policy. Realistically speaking, bit
> > > skeptical
> > > > on
> > > > > minor version releases every month, but nvm. guess its just a
> > rough
> > > > > estimate.
> > > > >
> > > > >> On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan
> > > > >>  wrote:
> > > > >>
> > > > >>
> > > > >> +1 on the process.
> > > > >> Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT,
> Gary
> > > Li <
> > > > >> garyli1...@outlook.com> wrote:
> > > > >>
> > > > >> +1
> > > > >> Gary LiFrom: Bhavani Sudha 
> > > > >> Sent: Wednesday, September 2, 2020 3:11:06 AM
> > > > >> To: us...@hudi.apache.org 
> > > > >> Cc: dev@hudi.apache.org 
> > > > >> Subject: Re: [DISCUSS] Formalizing the release process +1 on
> the
> > > release
> > > > >> process formalization.
> > > > >>
> > > > >>> On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar <
> > > vin...@apache.org>
> > > > wrote:
> > > > >>>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Love to start a discussion around how we can formalize the
> > > release
> > > > >>> process, timelines more so that we can ensure timely and
> > quality
> > > > >> releases.
> > > > >>>
> > > > >>> Below is an outline of an idea that was discussed in the last
> > > community
> > > > >>> sync (also in the weekly sync notes).
> > > > >>>
> > > > >>> - We will do a "feature driven" major version release, every
> 3
> > > months
> > > > or
> > > > >>> so. i.e going from version x.y to x.y+1. The idea here is
> this
> > > ships
> > > > once
> > > > >>> all the committed features are code complete, tested and
> > > verified.
> > > > >>> - We keep doing patches, bug fixes and usability improvements
> > to
> > > the
> > > > >>> project always. So, we will also do a "time driven" minor
> > version
> > > > release
> > > > >>> x.y.z → x.y.z+1 every month or so
> > > > >>> - We will always be releasing from master and thus major
> > release
> > > > features
> > > > >>> need to be guarded by flags, on minor versions.
> > > > >>> - We will try to avoid patch releases. i.e cherry-picking a
> few
> > > commits
> > > > >>> onto an earlier release version. (during 0.5.3 we actually
> > found
> > > the
> > > > >>> cherry-picking of master onto 0.5.2 pretty tricky and even
> > > > error-prone).
> > > > >>> Some cases, we may have to just make patch releases. But only
> > > > extenuating
> > > > >>> circumstances. Over time, with better tooling and a larger
> > > community,
> > > > we
> > > > >>> might be able to do this.
> > > > >>>
> > > > >>> As for the major release planning process.
> > > > >>>
> > > > >>>   - PMC/Committers can come up with an initial list sourced
> > > based on
> > > > >>>   user asks, support issue
> > > > >>>   - List is shared with the community, for feedback.
> community
> > > can
> > > > >>>   suggest new items, re-prioritizations
> > > > >>>   - Contributors are welcome to commit more features/asks,
> > (with
> > > due
> > > > >>>   process)
> > > > >>>
> > > > >>> I would love to hear +1s, -1s and also any new, completely
> > > different
> > > > >> ideas
> > > > >>> as well. Let's use this thread to align ourselves.
> > > > >>>
> > > > >>> Once we align ourselves, there are some release certification
> > > tools
> > > > that
> > > > >>> need to be built 

20200908 Weekly Sync Minutes

2020-09-08 Thread Vinoth Chandar
https://cwiki.apache.org/confluence/display/HUDI/20200908+Weekly+Sync+Minutes

Please find this week's sync notes


Request to Add in Contributor list

2020-09-08 Thread Mani Jindal
Hi team

Please guide me how can i request for the contributor access for jira so
that i can assign some jira tickets to myself and contribute to the hudi
community.

JIRA Username:  *manijndl77*
Email:  *manijn...@gmail.com *
Full Name : *Mani Jindal*

Thanks and Regards
Mani Jindal


Re: [Question] Redundant release tag?

2020-09-08 Thread Balaji Varadarajan
 
Deleted.
Thanks,Balaji.VOn Tuesday, September 8, 2020, 08:51:36 PM PDT, Raymond Xu 
 wrote:  
 
 I think there is a mistakenly created version tag 0.60 in JIRA; the number
does not seem to follow the release format.
Anyone care to delete this?
https://issues.apache.org/jira/projects/HUDI/versions/12348551
  

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-08 Thread Balaji Varadarajan
 Hi Raymond,
IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work 
correctly (e.g: "base/partition/*"). The path-cache is at partition level and 
not at table level so we need to extract the partition-path correctly to be 
used as look-up key. To extract partition-path, the challenge here is "Path" 
type does not have APIs to quickly figure if a path is a directory or not and 
we should avoid making RPC calls here. 
Thanks,Balaji.V
On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu 
 wrote:  
 
 
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121

As shown in the 2 lines above, it does not seem to work with directory
Path.
It should work for both `new Path("base/partition")` and `new
Path("base/partition/")`, but it only works for the former case. In the
latter case, `folder` will be "base/partition" and `path` will be
"base/partition/", which will always result in returning false.
A potential bug?
  

[Question] Redundant release tag?

2020-09-08 Thread Raymond Xu
I think there is a mistakenly created version tag 0.60 in JIRA; the number
does not seem to follow the release format.
Anyone care to delete this?
https://issues.apache.org/jira/projects/HUDI/versions/12348551


Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-08 Thread Balaji Varadarajan
 +1
On Tuesday, September 8, 2020, 05:54:52 PM PDT, Mehrotra, Udit 
 wrote:  
 
 I am okay with this too.

On 9/8/20, 5:33 PM, "Raymond Xu"  wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    I'm ok with 1 hr earlier.

    On Tue, Sep 8, 2020, 5:09 PM Vinoth Chandar  wrote:

    > Anyone else wants to chime in for a new time, that works for everyone?
    >
    > Personally, I can do this time.
    >
    >  love to hear more inputs.
    >
    > On Wed, Sep 2, 2020 at 10:16 AM Pratyaksh Sharma 
    > wrote:
    >
    > > Hi everyone,
    > >
    > > Currently we are having weekly sync ups between 9 PM - 10 PM PST on
    > > tuesdays. Since I have switched my job last to last month (in India),
    > this
    > > time is exactly clashing with the daily standup time at my current org.
    > > This is the reason I have not been able to attend the syncups for quite
    > > some time.
    > >
    > > Hence just wanted to check with everyone if we could move the sync up
    > time
    > > to 1 hour before, i.e have it from 8 PM - 9 PM every tuesday? Please let
    > me
    > > know if this is suitable.
    > >
    >

  

Re: [DISCUSS] Formalizing the release process

2020-09-08 Thread Vinoth Chandar
Would love to understand the general skepticism a bit more.
Is it rooted more on hitting those in the short term? or even in the longer
run with a better test infrastructure in place?

On Tue, Sep 8, 2020 at 6:42 PM Raymond Xu 
wrote:

> +1. Also a bit skeptical on monthly minor releases. But can give it a try.
>
> On Tue, Sep 8, 2020 at 5:55 PM Mehrotra, Udit 
> wrote:
>
> > +1 on the process.
> >
> > On 9/8/20, 5:11 PM, "Vinoth Chandar"  wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> >
> >
> >
> > >, bit skeptical on minor version releases every month, but nvm.
> guess
> > its
> > just a rough estimate.
> >
> > That's an aspirational goal that we should try to hit. We have all
> > worked
> > on teams/projects that shipped at that cadence regularly.
> > It's a matter of getting our test infrastructure and processes
> > streamlined
> > IMO :)
> >
> > On Fri, Sep 4, 2020 at 8:29 AM Nishith  wrote:
> >
> > > +1 on the process
> > >
> > > Sent from my iPhone
> > >
> > > > On Sep 3, 2020, at 8:14 AM, Sivabalan 
> wrote:
> > > >
> > > > +1 on the general release policy. Realistically speaking, bit
> > skeptical
> > > on
> > > > minor version releases every month, but nvm. guess its just a
> rough
> > > > estimate.
> > > >
> > > >> On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan
> > > >>  wrote:
> > > >>
> > > >>
> > > >> +1 on the process.
> > > >> Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary
> > Li <
> > > >> garyli1...@outlook.com> wrote:
> > > >>
> > > >> +1
> > > >> Gary LiFrom: Bhavani Sudha 
> > > >> Sent: Wednesday, September 2, 2020 3:11:06 AM
> > > >> To: us...@hudi.apache.org 
> > > >> Cc: dev@hudi.apache.org 
> > > >> Subject: Re: [DISCUSS] Formalizing the release process +1 on the
> > release
> > > >> process formalization.
> > > >>
> > > >>> On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar <
> > vin...@apache.org>
> > > wrote:
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> Love to start a discussion around how we can formalize the
> > release
> > > >>> process, timelines more so that we can ensure timely and
> quality
> > > >> releases.
> > > >>>
> > > >>> Below is an outline of an idea that was discussed in the last
> > community
> > > >>> sync (also in the weekly sync notes).
> > > >>>
> > > >>> - We will do a "feature driven" major version release, every 3
> > months
> > > or
> > > >>> so. i.e going from version x.y to x.y+1. The idea here is this
> > ships
> > > once
> > > >>> all the committed features are code complete, tested and
> > verified.
> > > >>> - We keep doing patches, bug fixes and usability improvements
> to
> > the
> > > >>> project always. So, we will also do a "time driven" minor
> version
> > > release
> > > >>> x.y.z → x.y.z+1 every month or so
> > > >>> - We will always be releasing from master and thus major
> release
> > > features
> > > >>> need to be guarded by flags, on minor versions.
> > > >>> - We will try to avoid patch releases. i.e cherry-picking a few
> > commits
> > > >>> onto an earlier release version. (during 0.5.3 we actually
> found
> > the
> > > >>> cherry-picking of master onto 0.5.2 pretty tricky and even
> > > error-prone).
> > > >>> Some cases, we may have to just make patch releases. But only
> > > extenuating
> > > >>> circumstances. Over time, with better tooling and a larger
> > community,
> > > we
> > > >>> might be able to do this.
> > > >>>
> > > >>> As for the major release planning process.
> > > >>>
> > > >>>   - PMC/Committers can come up with an initial list sourced
> > based on
> > > >>>   user asks, support issue
> > > >>>   - List is shared with the community, for feedback. community
> > can
> > > >>>   suggest new items, re-prioritizations
> > > >>>   - Contributors are welcome to commit more features/asks,
> (with
> > due
> > > >>>   process)
> > > >>>
> > > >>> I would love to hear +1s, -1s and also any new, completely
> > different
> > > >> ideas
> > > >>> as well. Let's use this thread to align ourselves.
> > > >>>
> > > >>> Once we align ourselves, there are some release certification
> > tools
> > > that
> > > >>> need to be built out. Hopefully, we can do this together. :)
> > > >>>
> > > >>>
> > > >>> Thanks
> > > >>> Vinoth
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > >
> >
> >
>


Re: [DISCUSS] Formalizing the release process

2020-09-08 Thread Raymond Xu
+1. Also a bit skeptical on monthly minor releases. But can give it a try.

On Tue, Sep 8, 2020 at 5:55 PM Mehrotra, Udit 
wrote:

> +1 on the process.
>
> On 9/8/20, 5:11 PM, "Vinoth Chandar"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> >, bit skeptical on minor version releases every month, but nvm. guess
> its
> just a rough estimate.
>
> That's an aspirational goal that we should try to hit. We have all
> worked
> on teams/projects that shipped at that cadence regularly.
> It's a matter of getting our test infrastructure and processes
> streamlined
> IMO :)
>
> On Fri, Sep 4, 2020 at 8:29 AM Nishith  wrote:
>
> > +1 on the process
> >
> > Sent from my iPhone
> >
> > > On Sep 3, 2020, at 8:14 AM, Sivabalan  wrote:
> > >
> > > +1 on the general release policy. Realistically speaking, bit
> skeptical
> > on
> > > minor version releases every month, but nvm. guess its just a rough
> > > estimate.
> > >
> > >> On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan
> > >>  wrote:
> > >>
> > >>
> > >> +1 on the process.
> > >> Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary
> Li <
> > >> garyli1...@outlook.com> wrote:
> > >>
> > >> +1
> > >> Gary LiFrom: Bhavani Sudha 
> > >> Sent: Wednesday, September 2, 2020 3:11:06 AM
> > >> To: us...@hudi.apache.org 
> > >> Cc: dev@hudi.apache.org 
> > >> Subject: Re: [DISCUSS] Formalizing the release process +1 on the
> release
> > >> process formalization.
> > >>
> > >>> On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar <
> vin...@apache.org>
> > wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> Love to start a discussion around how we can formalize the
> release
> > >>> process, timelines more so that we can ensure timely and quality
> > >> releases.
> > >>>
> > >>> Below is an outline of an idea that was discussed in the last
> community
> > >>> sync (also in the weekly sync notes).
> > >>>
> > >>> - We will do a "feature driven" major version release, every 3
> months
> > or
> > >>> so. i.e going from version x.y to x.y+1. The idea here is this
> ships
> > once
> > >>> all the committed features are code complete, tested and
> verified.
> > >>> - We keep doing patches, bug fixes and usability improvements to
> the
> > >>> project always. So, we will also do a "time driven" minor version
> > release
> > >>> x.y.z → x.y.z+1 every month or so
> > >>> - We will always be releasing from master and thus major release
> > features
> > >>> need to be guarded by flags, on minor versions.
> > >>> - We will try to avoid patch releases. i.e cherry-picking a few
> commits
> > >>> onto an earlier release version. (during 0.5.3 we actually found
> the
> > >>> cherry-picking of master onto 0.5.2 pretty tricky and even
> > error-prone).
> > >>> Some cases, we may have to just make patch releases. But only
> > extenuating
> > >>> circumstances. Over time, with better tooling and a larger
> community,
> > we
> > >>> might be able to do this.
> > >>>
> > >>> As for the major release planning process.
> > >>>
> > >>>   - PMC/Committers can come up with an initial list sourced
> based on
> > >>>   user asks, support issue
> > >>>   - List is shared with the community, for feedback. community
> can
> > >>>   suggest new items, re-prioritizations
> > >>>   - Contributors are welcome to commit more features/asks, (with
> due
> > >>>   process)
> > >>>
> > >>> I would love to hear +1s, -1s and also any new, completely
> different
> > >> ideas
> > >>> as well. Let's use this thread to align ourselves.
> > >>>
> > >>> Once we align ourselves, there are some release certification
> tools
> > that
> > >>> need to be built out. Hopefully, we can do this together. :)
> > >>>
> > >>>
> > >>> Thanks
> > >>> Vinoth
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> >
>
>


Hudi CLI AWS Glue & S3 Tables

2020-09-08 Thread Adam
Hey guys,
I'm trying to use the Hudi CLI to connect to tables stored on S3 using the
Glue metastore. Using a tip from Ashish M G

on Slack, I added the dependencies, re-built and was able to use the
connect command to connect to the table, albeit with warnings:

hudi->connect --path s3a://bucketName/path.parquet

29597 [Spring Shell] INFO
org.apache.hudi.common.table.HoodieTableMetaClient  - Loading
HoodieTableMetaClient from s3a://bucketName/path.parquet

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by
org.apache.hadoop.security.authentication.util.KerberosUtil
(file:/home/username/hudi-cli/target/lib/hadoop-auth-2.7.3.jar) to method
sun.security.krb5.Config.getInstance()

WARNING: Please consider reporting this to the maintainers of
org.apache.hadoop.security.authentication.util.KerberosUtil

WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations

WARNING: All illegal access operations will be denied in a future release

29785 [Spring Shell] WARN  org.apache.hadoop.util.NativeCodeLoader  -
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable

31060 [Spring Shell] INFO  org.apache.hudi.common.fs.FSUtils  - Hadoop
Configuration: fs.defaultFS: [file:///], Config:[Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml],
FileSystem: [org.apache.hadoop.fs.s3a.S3AFileSystem@6b725a01]

31380 [Spring Shell] INFO  org.apache.hudi.common.table.HoodieTableConfig  -
Loading table properties from
s3a://bucketName/path.parquet/.hoodie/hoodie.properties

31455 [Spring Shell] INFO
org.apache.hudi.common.table.HoodieTableMetaClient  - Finished Loading
Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from
s3a://bucketName/path.parquet

Metadata for table tablename loaded

However, many of the other commands seem to not be working properly:

hudi:tablename->savepoints show

╔═══╗

║ SavepointTime ║

╠═══╣

║ (empty)   ║

╚═══╝

hudi:tablename->savepoint create

Commit null not found in Commits
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline:
[20200724220817__commit__COMPLETED]


hudi:tablename->stats filesizes

╔╤═══╤═══╤═══╤═══╤═══╤═══╤══╤╗

║ CommitTime │ Min   │ 10th  │ 50th  │ avg   │ 95th  │ Max   │ NumFiles │
StdDev ║

╠╪═══╪═══╪═══╪═══╪═══╪═══╪══╪╣

║ ALL│ 0.0 B │ 0.0 B │ 0.0 B │ 0.0 B │ 0.0 B │ 0.0 B │ 0│
0.0 B  ║

╚╧═══╧═══╧═══╧═══╧═══╧═══╧══╧╝


hudi:tablename->show fsview all

171314 [Spring Shell] INFO
org.apache.hudi.common.table.HoodieTableMetaClient  - Loading
HoodieTableMetaClient from s3a://bucketName/path.parquet

171362 [Spring Shell] INFO  org.apache.hudi.common.fs.FSUtils  - Hadoop
Configuration: fs.defaultFS: [file:///], Config:[Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml],
FileSystem: [org.apache.hadoop.fs.s3a.S3AFileSystem@6b725a01]

171666 [Spring Shell] INFO  org.apache.hudi.common.table.HoodieTableConfig  -
Loading table properties from
s3a://bucketName/path.parquet/.hoodie/hoodie.properties

171725 [Spring Shell] INFO
org.apache.hudi.common.table.HoodieTableMetaClient  - Finished Loading
Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from
s3a://bucketName/path.parquet

171725 [Spring Shell] INFO
org.apache.hudi.common.table.HoodieTableMetaClient  - Loading Active commit
timeline for s3a://bucketName/path.parquet

171817 [Spring Shell] INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline  - Loaded
instants [[20200724220817__clean__COMPLETED],
[20200724220817__commit__COMPLETED]]

172262 [Spring Shell] INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView  -
addFilesToView: NumFiles=0, NumFileGroups=0, FileGroupsCreationTime=5,
StoreTimeTaken=2

╔═══╤╤══╤═══╤╤═╤═══╤═╗

║ Partition │ FileId │ Base-Instant │ Data-File │ Data-File Size │ Num
Delta Files │ Total Delta File Size │ Delta Files ║

╠═══╧╧══╧═══╧╧═╧═══╧═╣

║ (empty)
  ║

╚╝

I looked through the CLI code, and it seems that for true support we would
need to add support for the different storage options hdfs/s3/azure/etc. in
HoodieTableMetaClient. As from my 

Re: [DISCUSS] Formalizing the release process

2020-09-08 Thread Mehrotra, Udit
+1 on the process.

On 9/8/20, 5:11 PM, "Vinoth Chandar"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



>, bit skeptical on minor version releases every month, but nvm. guess its
just a rough estimate.

That's an aspirational goal that we should try to hit. We have all worked
on teams/projects that shipped at that cadence regularly.
It's a matter of getting our test infrastructure and processes streamlined
IMO :)

On Fri, Sep 4, 2020 at 8:29 AM Nishith  wrote:

> +1 on the process
>
> Sent from my iPhone
>
> > On Sep 3, 2020, at 8:14 AM, Sivabalan  wrote:
> >
> > +1 on the general release policy. Realistically speaking, bit skeptical
> on
> > minor version releases every month, but nvm. guess its just a rough
> > estimate.
> >
> >> On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan
> >>  wrote:
> >>
> >>
> >> +1 on the process.
> >> Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary Li <
> >> garyli1...@outlook.com> wrote:
> >>
> >> +1
> >> Gary LiFrom: Bhavani Sudha 
> >> Sent: Wednesday, September 2, 2020 3:11:06 AM
> >> To: us...@hudi.apache.org 
> >> Cc: dev@hudi.apache.org 
> >> Subject: Re: [DISCUSS] Formalizing the release process +1 on the 
release
> >> process formalization.
> >>
> >>> On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Love to start a discussion around how we can formalize the release
> >>> process, timelines more so that we can ensure timely and quality
> >> releases.
> >>>
> >>> Below is an outline of an idea that was discussed in the last 
community
> >>> sync (also in the weekly sync notes).
> >>>
> >>> - We will do a "feature driven" major version release, every 3 months
> or
> >>> so. i.e going from version x.y to x.y+1. The idea here is this ships
> once
> >>> all the committed features are code complete, tested and verified.
> >>> - We keep doing patches, bug fixes and usability improvements to the
> >>> project always. So, we will also do a "time driven" minor version
> release
> >>> x.y.z → x.y.z+1 every month or so
> >>> - We will always be releasing from master and thus major release
> features
> >>> need to be guarded by flags, on minor versions.
> >>> - We will try to avoid patch releases. i.e cherry-picking a few 
commits
> >>> onto an earlier release version. (during 0.5.3 we actually found the
> >>> cherry-picking of master onto 0.5.2 pretty tricky and even
> error-prone).
> >>> Some cases, we may have to just make patch releases. But only
> extenuating
> >>> circumstances. Over time, with better tooling and a larger community,
> we
> >>> might be able to do this.
> >>>
> >>> As for the major release planning process.
> >>>
> >>>   - PMC/Committers can come up with an initial list sourced based on
> >>>   user asks, support issue
> >>>   - List is shared with the community, for feedback. community can
> >>>   suggest new items, re-prioritizations
> >>>   - Contributors are welcome to commit more features/asks, (with due
> >>>   process)
> >>>
> >>> I would love to hear +1s, -1s and also any new, completely different
> >> ideas
> >>> as well. Let's use this thread to align ourselves.
> >>>
> >>> Once we align ourselves, there are some release certification tools
> that
> >>> need to be built out. Hopefully, we can do this together. :)
> >>>
> >>>
> >>> Thanks
> >>> Vinoth
> >>>
> >>
> >
> >
> >
> > --
> > Regards,
> > -Sivabalan
>



Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-08 Thread Mehrotra, Udit
I am okay with this too.

On 9/8/20, 5:33 PM, "Raymond Xu"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



I'm ok with 1 hr earlier.

On Tue, Sep 8, 2020, 5:09 PM Vinoth Chandar  wrote:

> Anyone else wants to chime in for a new time, that works for everyone?
>
> Personally, I can do this time.
>
>  love to hear more inputs.
>
> On Wed, Sep 2, 2020 at 10:16 AM Pratyaksh Sharma 
> wrote:
>
> > Hi everyone,
> >
> > Currently we are having weekly sync ups between 9 PM - 10 PM PST on
> > tuesdays. Since I have switched my job last to last month (in India),
> this
> > time is exactly clashing with the daily standup time at my current org.
> > This is the reason I have not been able to attend the syncups for quite
> > some time.
> >
> > Hence just wanted to check with everyone if we could move the sync up
> time
> > to 1 hour before, i.e have it from 8 PM - 9 PM every tuesday? Please let
> me
> > know if this is suitable.
> >
>



Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-08 Thread Raymond Xu
I'm ok with 1 hr earlier.

On Tue, Sep 8, 2020, 5:09 PM Vinoth Chandar  wrote:

> Anyone else wants to chime in for a new time, that works for everyone?
>
> Personally, I can do this time.
>
>  love to hear more inputs.
>
> On Wed, Sep 2, 2020 at 10:16 AM Pratyaksh Sharma 
> wrote:
>
> > Hi everyone,
> >
> > Currently we are having weekly sync ups between 9 PM - 10 PM PST on
> > tuesdays. Since I have switched my job last to last month (in India),
> this
> > time is exactly clashing with the daily standup time at my current org.
> > This is the reason I have not been able to attend the syncups for quite
> > some time.
> >
> > Hence just wanted to check with everyone if we could move the sync up
> time
> > to 1 hour before, i.e have it from 8 PM - 9 PM every tuesday? Please let
> me
> > know if this is suitable.
> >
>


Re: [DISCUSS] Formalizing the release process

2020-09-08 Thread Vinoth Chandar
>, bit skeptical on minor version releases every month, but nvm. guess its
just a rough estimate.

That's an aspirational goal that we should try to hit. We have all worked
on teams/projects that shipped at that cadence regularly.
It's a matter of getting our test infrastructure and processes streamlined
IMO :)

On Fri, Sep 4, 2020 at 8:29 AM Nishith  wrote:

> +1 on the process
>
> Sent from my iPhone
>
> > On Sep 3, 2020, at 8:14 AM, Sivabalan  wrote:
> >
> > +1 on the general release policy. Realistically speaking, bit skeptical
> on
> > minor version releases every month, but nvm. guess its just a rough
> > estimate.
> >
> >> On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan
> >>  wrote:
> >>
> >>
> >> +1 on the process.
> >> Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary Li <
> >> garyli1...@outlook.com> wrote:
> >>
> >> +1
> >> Gary LiFrom: Bhavani Sudha 
> >> Sent: Wednesday, September 2, 2020 3:11:06 AM
> >> To: us...@hudi.apache.org 
> >> Cc: dev@hudi.apache.org 
> >> Subject: Re: [DISCUSS] Formalizing the release process +1 on the release
> >> process formalization.
> >>
> >>> On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Love to start a discussion around how we can formalize the release
> >>> process, timelines more so that we can ensure timely and quality
> >> releases.
> >>>
> >>> Below is an outline of an idea that was discussed in the last community
> >>> sync (also in the weekly sync notes).
> >>>
> >>> - We will do a "feature driven" major version release, every 3 months
> or
> >>> so. i.e going from version x.y to x.y+1. The idea here is this ships
> once
> >>> all the committed features are code complete, tested and verified.
> >>> - We keep doing patches, bug fixes and usability improvements to the
> >>> project always. So, we will also do a "time driven" minor version
> release
> >>> x.y.z → x.y.z+1 every month or so
> >>> - We will always be releasing from master and thus major release
> features
> >>> need to be guarded by flags, on minor versions.
> >>> - We will try to avoid patch releases. i.e cherry-picking a few commits
> >>> onto an earlier release version. (during 0.5.3 we actually found the
> >>> cherry-picking of master onto 0.5.2 pretty tricky and even
> error-prone).
> >>> Some cases, we may have to just make patch releases. But only
> extenuating
> >>> circumstances. Over time, with better tooling and a larger community,
> we
> >>> might be able to do this.
> >>>
> >>> As for the major release planning process.
> >>>
> >>>   - PMC/Committers can come up with an initial list sourced based on
> >>>   user asks, support issue
> >>>   - List is shared with the community, for feedback. community can
> >>>   suggest new items, re-prioritizations
> >>>   - Contributors are welcome to commit more features/asks, (with due
> >>>   process)
> >>>
> >>> I would love to hear +1s, -1s and also any new, completely different
> >> ideas
> >>> as well. Let's use this thread to align ourselves.
> >>>
> >>> Once we align ourselves, there are some release certification tools
> that
> >>> need to be built out. Hopefully, we can do this together. :)
> >>>
> >>>
> >>> Thanks
> >>> Vinoth
> >>>
> >>
> >
> >
> >
> > --
> > Regards,
> > -Sivabalan
>


Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-08 Thread Vinoth Chandar
Anyone else wants to chime in for a new time, that works for everyone?

Personally, I can do this time.

 love to hear more inputs.

On Wed, Sep 2, 2020 at 10:16 AM Pratyaksh Sharma 
wrote:

> Hi everyone,
>
> Currently we are having weekly sync ups between 9 PM - 10 PM PST on
> tuesdays. Since I have switched my job last to last month (in India), this
> time is exactly clashing with the daily standup time at my current org.
> This is the reason I have not been able to attend the syncups for quite
> some time.
>
> Hence just wanted to check with everyone if we could move the sync up time
> to 1 hour before, i.e have it from 8 PM - 9 PM every tuesday? Please let me
> know if this is suitable.
>


Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-08 Thread Satish Kotha
Hi folks,

Any thoughts on this? At a high level, we want to change high
watermark commit through a property to perform pre-commit and post-commit
hooks. Is this useful for anyone else?

On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan  wrote:

> Hello folks,
>
> We have a use case to make sure data in the same hudi datasets stored in
> different DC ( for high availability / disaster recovery ) are strongly
> consistent as well as pass all quality checks before they can be consumed
> by users who we try to query them. Currently, we have an offline service
> that runs quality checks as well as asynchronously syncs the hudi datasets
> between different DC/AZ but till the sync happens queries running in these
> different DC see inconsistent results. For some of our most critical
> datasets this inconsistency is causing so many problems.
>
> We want to support the need for following use cases 1) data consistency 2)
> Adding data quality checks post commit.
>
> Our flow looks like this
> 1) write new batch of data at t1
> 2) user queries will not see data at t1
> 3) data quality checks are done by setting a session property to include t1
> 4) optionally replicate t1 to other AZs and promote t1 so regular user
> queries will see data at t1
>
> We want to make the following changes to achieve this.
>
> 1. Change the HoodieParquetInputFormat to look for
> 'last_replication_timestamp' property in the JobConf and use this to create
> a new ActiveTimeline that limits the commits seen to be lesser than or
> equal to this timestamp. This can be overridden by a session property that
> will allow us to make such data visible for quality checks.
>
> 2. We are storing this particular timestamp as a table property in
> HiveMetaStore. To make it easier to update we want to extend the
> HiveSyncTool to also update this table property when syncing hudi dataset
> to the hms. The extended tool will take in a list of HMS's to be updated
> and will try to update each of them one by one. ( In case of global HMS
> across all DC this is just one, but if there is region local HMS per DC the
> update of all HMS is not truly transaction so there is a small window of
> time where the queries can return inconsistent results ). If the tool can't
> update all the HMS it will rollback the updated ones ( again not applicable
> for global HMS ).
>
> We have made the above changes to our internal branch and we are
> successfully running it in production.
>
> Please let us know of feedback about this change.
>
> Sanjay
>


[Question] HoodieROTablePathFilter not accept dir path

2020-09-08 Thread Raymond Xu
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121

As shown in the 2 lines above, it does not seem to work with directory
Path.
It should work for both `new Path("base/partition")` and `new
Path("base/partition/")`, but it only works for the former case. In the
latter case, `folder` will be "base/partition" and `path` will be
"base/partition/", which will always result in returning false.
A potential bug?