Re: [VOTE] Release 0.14.0, release candidate #3

2023-09-22 Thread Balaji Varadarajan
 +1 (binding)
Ran validate stage testChecking Checksum of Source Release
  Checksum Check of Source Release - [OK]



Checking Signature

  Signature Check - [OK]




Checking for binary files in the source files

  No Binary Files in the source files? - [OK]




Checking for DISCLAIMER

  DISCLAIMER file exists ? [OK]




Checking for LICENSE and NOTICE

  License file exists ? [OK]

  Notice file exists ? [OK]




Performing custom Licensing Check 

  Licensing Check Passed [OK]



Running RAT Check.       RAT Check Passed [OK]

On Friday, September 22, 2023 at 11:33:54 AM PDT, Amrish Lal 
 wrote:  
 
 +1 (non binding)

- Sanity tests using COW/MOR table to create, update, delete, and query
records.
- Tested use of RLI in snapshot, realtime, time-travel, and incremental
queries.
- Overall OK, except that use of RLI should be disabled for time-travel (
HUDI-6886 ) and snapshot
queries (HUDI-6891 )

On Fri, Sep 22, 2023 at 11:26 AM Y Ethan Guo  wrote:

> +1 (binding)
> - Ran validate_staged_release.sh [OK]
> - Hudi (Delta)streamer with error injection [OK]
> - Bundle validation https://github.com/apache/hudi/actions/runs/6277569953
> [OK]
>
> - Ethan
>
>
> On Fri, Sep 22, 2023 at 10:29 AM Jonathan Vexler  wrote:
>
> > +1 (non-binding)
> > - Tested Spark Datasource and Spark Sql core flow tests
> > -Tested reading from bootstrap tables
> >
> >
> > On Fri, Sep 22, 2023 at 12:39 PM sagar sumit  wrote:
> >
> > > +1 (non-binding)
> > >
> > > - Long-running deltastreamer [OK]
> > > - Hive metastore sync [OK]
> > > - Query using Presto and Trino [OK]
> > >
> > > Regards,
> > > Sagar
> > >
> > > On Fri, Sep 22, 2023 at 9:53 PM Aditya Goenka 
> > wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - Tested Spark Sql workflows , delta streamer , spark structured
> > > streaming
> > > > for both types of tables with and without record key.
> > > > - Meta Sync tests
> > > > - Tests for data-skipping with both Column stats and RLI.
> > > >
> > > > On Fri, Sep 22, 2023 at 9:38 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > >
> > > > >    - Ran rc checks on RC2 only, but nothing has seemed to change.
> > > > >    - Tested Spark Datasource/SQL flows around new features like
> auto
> > > key
> > > > >    generation. This is a simpler SQL experience.
> > > > >
> > > > >    Thanks to all the contributors !
> > > > >
> > > > >
> > > > > On Tue, Sep 19, 2023 at 11:56 AM Prashant Wason
> > >  > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Please review and vote on the *release candidate #3* for the
> > version
> > > > > > 0.14.0, as follows:
> > > > > >
> > > > > > [ ] +1, Approve the release
> > > > > >
> > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > >
> > > > > >
> > > > > >
> > > > > > The complete staging area is available for your review, which
> > > includes:
> > > > > >
> > > > > > * JIRA release notes [1],
> > > > > >
> > > > > > * the official Apache source release and binary convenience
> > releases
> > > to
> > > > > be
> > > > > > deployed to dist.apache.org
> > > > > > 
> > [2],
> > > > > which
> > > > > > are signed with the key with
> > > > > > fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],
> > > > > >
> > > > > > * all artifacts to be deployed to the Maven Central Repository
> [4],
> > > > > >
> > > > > > * source code tag "0.14.0-rc3" [5],
> > > > > >
> > > > > >
> > > > > >
> > > > > > The vote will be open for at least 72 hours. It is adopted by
> > > majority
> > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Prashant Wason
> > > > > >
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700
> > > > > >
> > > > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc3/
> > > > > >
> > > > > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > > > > >
> > > > > > [4]
> > > > >
> > https://repository.apache.org/content/repositories/orgapachehudi-1127/
> > > > > >
> > > > > > [5]
> https://github.com/apache/hudi/releases/tag/release-0.14.0-rc3
> > > > > > 
> > > > > >
> > > > >
> > > >
> > >
> >
>
  

Re: [VOTE] Release 0.12.2, release candidate #1

2022-12-24 Thread Balaji Varadarajan


+1 (binding)

Ran release validation script.

(⎈|dev-core-0:N/A)balaji-varadarajan--NR26725P2G:scripts balaji.varadarajan$ 
./release/validate_staged_release.sh --release=0.12.2 --rc_num=1
/tmp/validation_scratch_dir_001 ~/code/oss/hudi/scripts
Downloading from svn co https://dist.apache.org/repos/dist/dev/hudi
Validating hudi-0.12.2-rc1 with release type "dev"
Checking Checksum of Source Release
Checksum Check of Source Release - [OK]

  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 69274  100 692740 0   131k  0 --:--:-- --:--:-- --:--:--  132k
Checking Signature
Signature Check - [OK]

Checking for binary files in source release
No Binary Files in Source Release? - [OK]

Checking for DISCLAIMER
DISCLAIMER file exists ? [OK]

Checking for LICENSE and NOTICE
License file exists ? [OK]
Notice file exists ? [OK]

Performing custom Licensing Check 
Licensing Check Passed [OK]

Running RAT Check
RAT Check Passed [OK]

~/code/oss/hudi/scripts
(⎈|dev-core-0:N/A)balaji-varadarajan--NR26725P2G:scripts balaji.varadarajan$ 
echo $?
0


On 2022/12/24 17:51:22 Shiyan Xu wrote:
> +1 (binding)
> 
> Validated bundle jars directly by running sanity tests.
> 
> On Sat, Dec 24, 2022 at 4:21 AM Alexey Kudinkin 
> wrote:
> 
> > +1 (non-binding)
> >
> > [OK] Built successfully for Spark 2.4, 3.x
> > [OK] Run Spark SQL tests
> >
> > On Fri, Dec 23, 2022 at 12:19 PM Y Ethan Guo  wrote:
> >
> > > +1 non-binding
> > >
> > > [OK] checksums and signatures
> > > [OK] ran release validation script
> > > [OK] built successfully (Spark 2.4, 3.3)
> > > [OK] Spark 3.3.1 quickstart guide
> > >
> > > On Fri, Dec 23, 2022 at 1:30 AM Bhavani Sudha 
> > > wrote:
> > >
> > > > +1 binding
> > > >
> > > > [OK] Build successfully multiple supported spark versions
> > > >
> > > > [OK] Ran validation script
> > > >
> > > > [OK] Ran QuickStart on spark 3.2
> > > >
> > > >
> > > > ./release/validate_staged_release.sh --release=0.12.2 --rc_num=1
> > > >
> > > > /tmp/validation_scratch_dir_001 ~/hudi/scripts
> > > >
> > > > Downloading from svn co https://dist.apache.org/repos/dist/dev/hudi
> > > >
> > > > Validating hudi-0.12.2-rc1 with release type "dev"
> > > >
> > > > Checking Checksum of Source Release
> > > >
> > > > Checksum Check of Source Release - [OK]
> > > >
> > > >
> > > >   % Total% Received % Xferd  Average Speed   TimeTime Time
> > > > Current
> > > >
> > > >  Dload  Upload   Total   SpentLeft
> > > > Speed
> > > >
> > > > 100 69274  100 692740 0  97810  0 --:--:-- --:--:--
> > --:--:--
> > > > 98962
> > > >
> > > > Checking Signature
> > > >
> > > > Signature Check - [OK]
> > > >
> > > >
> > > > Checking for binary files in source release
> > > >
> > > > No Binary Files in Source Release? - [OK]
> > > >
> > > >
> > > > Checking for DISCLAIMER
> > > >
> > > > DISCLAIMER file exists ? [OK]
> > > >
> > > >
> > > > Checking for LICENSE and NOTICE
> > > >
> > > > License file exists ? [OK]
> > > >
> > > > Notice file exists ? [OK]
> > > >
> > > >
> > > > Performing custom Licensing Check
> > > >
> > > > Licensing Check Passed [OK]
> > > >
> > > >
> > > > Running RAT Check
> > > >
> > > > RAT Check Passed [OK]
> > > >
> > > >
> > > > ~/hudi/scripts
> > > >
> > > >
> > > >
> > > > On Thu, Dec 22, 2022 at 8:18 PM sagar sumit  wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Ran long-running deltastreamer.
> > > > > Validated meta sync and queried tables through Presto/Trino.
> > > > >
> > > > > On Fri, Dec 23, 2022 at 5:14 AM Sivabalan 
> > wrote:
> > > > >
> > > > > > +1 binding.
> > > > > &

Re: [VOTE] Release 0.12.0, release candidate #2

2022-08-15 Thread Balaji Varadarajan
 +1 (binding)
On Monday, August 15, 2022 at 08:42:08 AM PDT, Rahil C 
 wrote:  
 
 +1

-Rahil C
On Mon, Aug 15, 2022 at 8:07 AM Nishith  wrote:

> +1 (binding)
>
> -Nishith
>
> > On Aug 15, 2022, at 12:20 AM, Shiyan Xu 
> wrote:
> >
> > +1 (binding)
> >
> > Manually ran deltastreamer job with different spark versions and passed.
> > Azure CI passed:
> >
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10726=results
> > GitHub Actions CI passed:
> https://github.com/apache/hudi/tree/release-0.12.0
> >
> >
> >> On Mon, Aug 15, 2022 at 12:45 AM Sivabalan  wrote:
> >>
> >> +1 binding.
> >>
> >> [OK] Built successfully
> >> [OK] Ran validation script
> >> [OK] Verified checksum
> >> [OK] Ran quickstart tests
> >> [OK] Ran deltastreamer job and ran some validations.
> >>
> >>
> >>> On Sun, 14 Aug 2022 at 22:32, Vinoth Chandar 
> wrote:
> >>>
> >>> +1 (binding)
> >>>
> >>> On Sun, Aug 14, 2022 at 14:50 Bhavani Sudha 
> >>> wrote:
> >>>
>  +1 (binding)
> 
> 
>  [OK] Build successfully all supported spark version
> 
>  [OK] Ran validation script
> 
>  [OK] Ran quickstart tests with spark 2.4
> 
>  [OK] Ran some IDE tests
> 
> 
>  sudha[9:33:26] scripts % ./release/validate_staged_release.sh
>  --release=0.12.0 --rc_num=2
> 
>  /tmp/validation_scratch_dir_001 ~/hudi/scripts
> 
>  Downloading from svn co https://dist.apache.org/repos/dist/dev/hudi
> 
>  Validating hudi-0.12.0-rc2 with release type "dev"
> 
>  Checking Checksum of Source Release
> 
>  Checksum Check of Source Release - [OK]
> 
> 
>   % Total    % Received % Xferd  Average Speed  Time    Time    Time
>  Current
> 
>                                 Dload  Upload  Total  Spent    Left
>  Speed
> 
>  100 62287  100 62287    0    0  39174      0  0:00:01  0:00:01
> >> --:--:--
>  39149
> 
>  Checking Signature
> 
>  Signature Check - [OK]
> 
> 
>  Checking for binary files in source release
> 
>  No Binary Files in Source Release? - [OK]
> 
> 
>  Checking for DISCLAIMER
> 
>  DISCLAIMER file exists ? [OK]
> 
> 
>  Checking for LICENSE and NOTICE
> 
>  License file exists ? [OK]
> 
>  Notice file exists ? [OK]
> 
> 
>  Performing custom Licensing Check
> 
>  Licensing Check Passed [OK]
> 
> 
>  Running RAT Check
> 
>  RAT Check Passed [OK]
> 
> 
>  Thanks,
> 
>  Sudha
> 
>  On Sun, Aug 14, 2022 at 11:16 AM Y Ethan Guo 
> wrote:
> 
> > +1 (non-binding)
> >
> > - [OK] checksums and signatures
> > - [OK] ran release validation script
> > - [OK] built successfully (Spark 2.4, 3.2, 3.3)
> > - [OK] ran Spark quickstart with Spark 3.3.0
> > - [OK] ran a few tests on schema evolution
> > - [OK] Presto connector performance
> >
> > Best,
> > - Ethan
> >
> > On Thu, Aug 11, 2022 at 5:22 AM sagar sumit 
> >> wrote:
> >
> >> Hi everyone,
> >>
> >> Please review and vote on the release candidate #2 for the version
> > 0.12.0,
> >> as follows:
> >>
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific
> >> comments)
> >>
> >> The complete staging area is available for your review, which
> >>> includes:
> >>
> >> * JIRA release notes [1],
> >> * the official Apache source release and binary convenience
> >> releases
> >>> to
> > be
> >> deployed to dist.apache.org [2], which are signed with the key
> >> with
> >> fingerprint FD215342E3199419ADFBF41DD4623E3AA16D75B0 [3],
> >> * all artifacts to be deployed to the Maven Central Repository [4],
> >> * source code tag "release-0.12.0-rc2" [5],
> >>
> >> The vote will be open for at least 72 hours. It is adopted by
> >>> majority
> >> approval, with at least 3 PMC affirmative votes.
> >>
> >> Thanks,
> >> Release Manager
> >>
> >> [1]
> >>
> >>
> >
> 
> >>>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351209
> >> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.0-rc2/
> >> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> >> [4]
> >
> >> https://repository.apache.org/content/repositories/orgapachehudi-1090/
> >> [5] https://github.com/apache/hudi/releases/tag/release-0.12.0-rc2
> >>
> >
> 
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> -Sivabalan
> >>
> >
> >
> > --
> > Best,
> > Shiyan
>
  

Re: Next stop : Minor Or Major release?

2022-02-18 Thread Balaji Varadarajan
+1 on option B.

Balaji.V

On Thu, Feb 17, 2022 at 11:20 PM Nishith  wrote:

> +1 to B for the same reasons
>
> -Nishith
>
> > On Feb 17, 2022, at 9:22 PM, Vinoth Chandar  wrote:
> >
> > +1 on B as well. same rationale as Raymond's. I think we have all major
> > chunks landed or PRs up.
> > Love to provide integration testing before the release.
> >
> >> On Thu, Feb 17, 2022 at 4:25 PM Raymond Xu  >
> >> wrote:
> >>
> >> I'm +1 to B. There are really awesome features planned for 0.11.0.
> Hoping
> >> to see these more thoroughly tested in the major release.
> >>
> >>
> >> --
> >> Best,
> >> Raymond
> >>
> >>
> >>> On Wed, Feb 16, 2022 at 5:13 AM Sivabalan  wrote:
> >>>
> >>> Hi folks,
> >>>   As Hudi community has been very active and is used by many across the
> >>> globe, we would like to have a continuous train of releases. Every 2
> to 3
> >>> months a major release and immediately following the major release, a
> >> minor
> >>> bug fix release(which we agreed upon as a community). If we look at the
> >>> roadmap laid out here , we may not be
> >>> able
> >>> to meet the deadline if we plan for a major release by Feb end. Even if
> >> not
> >>> all, we are looking to complete a sizable features. We might need
> >> atleast 2
> >>> weeks for proper integration testing.
> >>>
> >>> Having said that we did a minor bug fix release of 0.10.1 by Jan 26th,
> we
> >>> have two options with us.
> >>>
> >>> Option A: Do another minor bug fix release by end of Feb. And do 0.11
> by
> >>> end of March.
> >>> Option B: We can go for 0.11 by end of march w/o needing another bug
> fix
> >>> release in between since we just had a release 2 weeks back.
> >>>
> >>> Do remember that, if we plan to go w/ 0.10.2, it might be yet another
> bug
> >>> fix release. and so we may not get any features as such which went in
> >> after
> >>> 0.10.0.
> >>>
> >>> Wanted to hear your thoughts and opinions.
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> -Sivabalan
> >>>
> >>
>
> -
> To unsubscribe, e-mail: users-unsubscr...@hudi.apache.org
> For additional commands, e-mail: users-h...@hudi.apache.org
>
>


Re: [VOTE] Release 0.10.1, release candidate #2

2022-01-24 Thread Balaji Varadarajan
 +1 binding. RC passed.
Balaji.V

On Monday, January 24, 2022, 10:28:58 AM PST, Bhavani Sudha 
 wrote:  
 
 +1 binding

Ran RC check, quickstart and some IDE tests.

Thanks,
Sudha

On Mon, Jan 24, 2022 at 9:23 AM sagar sumit  wrote:

> +1
>
> - Builds for Spark2/3 [OK]
> - Spark quickstart [OK]
> - Docker Demo (Hive/Presto querying) [OK]
> - Long-running deltastreamer continuous mode with async
> compaction/clustering [OK]
>
> Regards,
> Sagar
>
> On Mon, Jan 24, 2022 at 10:23 PM Sivabalan  wrote:
>
>> Hey folks,
>>      Can we get some attention on this. I expect participation from PMCs
>> and committers atleast. Would appreciate, if you folks can spare some time
>> on RC testing and voting.
>>
>>
>> On Mon, 24 Jan 2022 at 07:54, Pratyaksh Sharma 
>> wrote:
>>
>> > +1
>> >
>> > - Compilation OK
>> > - Validation script OK
>> >
>> > On Sun, Jan 23, 2022 at 8:09 PM Nishith  wrote:
>> >
>> > > +1 binding
>> > >
>> > > -Nishith
>> > >
>> > > > On Jan 22, 2022, at 7:49 PM, Vinoth Chandar 
>> wrote:
>> > > >
>> > > > +1 (binding)
>> > > >
>> > > > Ran my rc checks on updated link and changing my vote to a +1
>> > > >
>> > > >> On Sat, Jan 22, 2022 at 4:10 AM Sivabalan 
>> wrote:
>> > > >>
>> > > >> my bad, the link([2]) was wrong. It is
>> > > >> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.1-rc2/.
>> > > >> Can you take a look please?
>> > > >>
>> > > >>> On Sat, 22 Jan 2022 at 00:08, Vinoth Chandar 
>> > > wrote:
>> > > >>>
>> > > >>> -1
>> > > >>>
>> > > >>> The artifact version is wrong! It should be 0.10.*1*
>> > > >>>
>> > > >>>
>> > > >>>  - hudi-0.10.0-rc2.src.tgz
>> > > >>>  <
>> > > >>>
>> > > >>
>> > >
>> >
>> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz
>> > > 
>> > > >>>  - hudi-0.10.0-rc2.src.tgz.asc
>> > > >>>  <
>> > > >>>
>> > > >>
>> > >
>> >
>> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz.asc
>> > > 
>> > > >>>  - hudi-0.10.0-rc2.src.tgz.sha512
>> > > >>>  <
>> > > >>>
>> > > >>
>> > >
>> >
>> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz.sha512
>> > > 
>> > > >>>
>> > > >>> grep version hudi-0.10.0-rc2/pom.xml | grep rc2
>> > > >>>  0.10.0-rc2
>> > > >>>
>> > > >>>
>> > > >>> Why are all the arc
>> > > >>>
>> > >  On Thu, Jan 20, 2022 at 3:53 AM Sivabalan 
>> > wrote:
>> > > >>>
>> > >  Hi everyone,
>> > > 
>> > >  Please review and vote on the release candidate #2 for the
>> version
>> > > >>> 0.10.1,
>> > >  as follows:
>> > > 
>> > >  [ ] +1, Approve the release
>> > > 
>> > >  [ ] -1, Do not approve the release (please provide specific
>> > comments)
>> > > 
>> > > 
>> > >  The complete staging area is available for your review, which
>> > > includes:
>> > > 
>> > >  * JIRA release notes [1],
>> > > 
>> > >  * the official Apache source release and binary convenience
>> releases
>> > > to
>> > > >>> be
>> > >  deployed to dist.apache.org [2], which are signed with the key
>> with
>> > >  fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
>> > > 
>> > >  * all artifacts to be deployed to the Maven Central Repository
>> [4],
>> > > 
>> > >  * source code tag "release-0.10.1-rc2" [5],
>> > > 
>> > > 
>> > >  The vote will be open for at least 72 hours. It is adopted by
>> > majority
>> > >  approval, with at least 3 PMC affirmative votes.
>> > > 
>> > > 
>> > >  Thanks,
>> > >  Release Manager
>> > > 
>> > > 
>> > >  [1]
>> > > 
>> > > 
>> > > >>>
>> > > >>
>> > >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351135
>> > > 
>> > >  [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2
>> > > 
>> > >  [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
>> > > 
>> > >  [4]
>> > > 
>> > > 
>> > > >>>
>> > > >>
>> > >
>> >
>> https://repository.apache.org/content/repositories/orgapachehudi-1052/org/apache/hudi/
>> > > 
>> > >  [5] https://github.com/apache/hudi/tree/release-0.10.1-rc2
>> > > 
>> > >  --
>> > >  Regards,
>> > >  -Sivabalan
>> > > 
>> > > >>>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Regards,
>> > > >> -Sivabalan
>> > > >>
>> > >
>> >
>>
>>
>> --
>> Regards,
>> -Sivabalan
>>
>
  

Re: [VOTE] Release 0.10.0, release candidate #3

2021-12-06 Thread Balaji Varadarajan
 +1 (binding)
- Package Build successful- Overnight staging test - Data Validation successful 
for COW upsert workload. 


On Monday, December 6, 2021, 06:40:32 AM PST, vino yang 
 wrote:  
 
 +1 (binding)

- build successfully
- ran spark quickstart
- verified checksum

Best,
Vino

Y Ethan Guo  于2021年12月6日周一 14:25写道:

> +1 (non-binding)
>
> - [OK] Ran release validation script [1]
> - [OK] Built the source (Spark 2/3)
> - [OK] Ran Spark Guide in Quick Start using Spark 3.1.2
>
> [1] https://gist.github.com/yihua/39ef5b07a08ed5780fa9c43819b326cb
>
> Best,
> - Ethan
>
> On Sat, Dec 4, 2021 at 1:27 PM Bhavani Sudha 
> wrote:
>
> > +1 (binding)
> >
> > - [OK] checksums and signatures
> > - [OK] ran validation script
> > - [OK] built successfully
> > - [OK] ran spark quickstart
> > - [OK] Ran few tests in IDE
> >
> >
> >
> > bsaktheeswaran@Bhavanis-MacBook-Pro scripts %
> > ./release/validate_staged_release.sh --release=0.10.0 --rc_num=3
> > /tmp/validation_scratch_dir_001 ~/Sudha/hudi/scripts
> > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > Validating hudi-0.10.0-rc3 with release type "dev"
> > Checking Checksum of Source Release
> > Checksum Check of Source Release - [OK]
> >
> >  % Total    % Received % Xferd  Average Speed  Time    Time    Time
> >  Current
> >                                  Dload  Upload  Total  Spent    Left
> >  Speed
> > 100 45904  100 45904    0    0  85323      0 --:--:-- --:--:-- --:--:--
> > 85165
> > Checking Signature
> > Signature Check - [OK]
> >
> > Checking for binary files in source release
> > No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > License file exists ? [OK]
> > Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > Licensing Check Passed [OK]
> >
> > Running RAT Check
> > RAT Check Passed [OK]
> >
> > Thanks,
> > Sudha
> >
> > On Sat, Dec 4, 2021 at 6:59 AM Vinoth Chandar  wrote:
> >
> > > +1 (binding)
> > >
> > > Ran the RC checks in [1] . This is a huge release, thanks everyone for
> > all
> > > the hard work!
> > >
> > > [1]
> > https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b
> > >
> > > On Sat, Dec 4, 2021 at 5:20 AM Danny Chan 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #3 for the version
> > > 0.10.0,
> > > > as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1],
> > > >
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 9A48922F682AB05D1AE4A3E7C2931E4BDB03D5AE [3],
> > > >
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > >
> > > > * source code tag "release-0.10.0-rc3" [5],
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > >
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350285
> > > >
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc3/
> > > >
> > > > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > > >
> > > > [4]
> > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachehudi-1048/org/apache/hudi/
> > > >
> > > > [5] https://github.com/apache/hudi/tree/release-0.10.0-rc3
> > > >
> > >
> >
>
  

Re: [VOTE] Release 0.9.0, release candidate #2

2021-08-23 Thread Balaji Varadarajan
 +1 (binding) 
$ ./release/validate_staged_release.sh --release=${RC_VERSION} --rc_num=2
...Downloading from svn co 
https://dist.apache.org/repos/dist//dev/hudiValidating hudi-0.9.0-rc2 with 
release type "dev"Checking Checksum of Source Release Checksum Check of Source 
Release - [OK]
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                Dload  Upload   Total   Spent    Left  Speed100 
42380  100 42380    0     0    98k      0 --:--:-- --:--:-- --:--:--   
98kChecking Signature Signature Check - [OK]
Checking for binary files in source release No Binary Files in Source Release? 
- [OK]
Checking for DISCLAIMER DISCLAIMER file exists ? [OK]
Checking for LICENSE and NOTICE License file exists ? [OK] Notice file exists ? 
[OK]
Performing custom Licensing Check  Licensing Check Passed [OK]
Running RAT Check RAT Check Passed [OK]
Balaji.V

On Monday, August 23, 2021, 03:34:37 PM PDT, Bhavani Sudha 
 wrote:  
 
 +1 (binding)

On Mon, Aug 23, 2021 at 3:23 PM Sivabalan  wrote:

> +1 (binding)
>
> 1. Release validation succeeded
> 2. Ran quick start for two variants (spark2, scala11 and spark3, scala12)
> for all operations.
> 3. Ran docker demo and verified all 3 query engines (spark sql, hive,
> presto) and 3 query types(snapshot, read optimized, incremental) across two
> tables.
>
> ./release/validate_staged_release.sh --release=0.9.0 --rc_num=2
> /tmp/validation_scratch_dir_001
> ~/Documents/personal/projects/a_hudi/hudi/scripts
> local dir local_svn_dir
> Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> Validating hudi-0.9.0-rc2 with release type "dev"
> Checking Checksum of Source Release
>    Checksum Check of Source Release - [OK]
>
>  % Total    % Received % Xferd  Average Speed  Time    Time    Time
>  Current
>                                  Dload  Upload  Total  Spent    Left
>  Speed
> 100 42380  100 42380    0    0  156k      0 --:--:-- --:--:-- --:--:--
>  156k
> Checking Signature
>    Signature Check - [OK]
>
> Checking for binary files in source release
>    No Binary Files in Source Release? - [OK]
>
> Checking for DISCLAIMER
>    DISCLAIMER file exists ? [OK]
>
> Checking for LICENSE and NOTICE
>    License file exists ? [OK]
>    Notice file exists ? [OK]
>
> Performing custom Licensing Check
>    Licensing Check Passed [OK]
>
> Running RAT Check
>    RAT Check Passed [OK]
>
>
>
>
>
>
>
> On Sun, Aug 22, 2021 at 9:09 PM Vinoth Chandar  wrote:
>
> > +1 (binding)
> >
> > RC check [1] passed
> >
> > [1]
> https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b
> >
> >
> > On Sun, Aug 22, 2021 at 1:28 PM Sivabalan  wrote:
> >
> > > We can keep the specific discussion out of this voting thread. Have
> > started
> > > a new thread here
> > > <
> > >
> >
> https://lists.apache.org/thread.html/r3bae7622904b04c7d1fb2ddaf5226e37166d5fbb1721f403b1b04545%40%3Cdev.hudi.apache.org%3E
> > > >
> > > to
> > > continue this discussion. We can keep this thread just for voting.
> > Thanks.
> > >
> > > On Sun, Aug 22, 2021 at 2:13 AM Danny Chan 
> wrote:
> > >
> > > > It's not a surprise that 0.9 has a longer release process, the Spark
> > SQL
> > > > was added and many promotions from the Flink engine. We need more
> > > patience
> > > > for this release IMO.
> > > >
> > > > Having another minor release like 0.9.1 is a solution but not a good
> > one,
> > > > people have much more promise to the major release and it carries
> > > > many expectations. If people report the problems during the release
> > > > process, just accept it if it is not a big PR/fix, and there are
> only a
> > > few
> > > > ones up to now. I would not take too much time.
> > > >
> > > > I know that it has been about 4 months since the last release, but
> > people
> > > > want a complete release version not a defective one.
> > > >
> > > > Best,
> > > > Danny
> > > >
> > > > Sivabalan  于2021年8月22日周日 上午11:50写道:
> > > >
> > > > > I would like to share my thoughts on the release process in
> general.
> > I
> > > > will
> > > > > read more about what exactly qualifies for -1 and will look into
> what
> > > > Peng
> > > > > and Danny has put up. But some thoughts on the release in general.
> > > > >
> > > > > Every release process is very tedious and time consuming and RM
> does
> > > put
> > > > in
> > > > > non-trivial amount of work in getting the release out. To make the
> > > > process
> > > > > smooth, RM started an email thread by Aug 3, calling for any
> release
> > > > > blockers. Would like to understand, if these were surfaced in that
> > > > thread?
> > > > > What I am afraid of is, we might keep delaying our release by
> adding
> > > more
> > > > > patches/bug fixes with every candidate. For instance, if we
> consider
> > > > these
> > > > > and RM works on RC3 and puts up a vote in 5 days and what if
> someone
> > > else
> > > > > wants to add a couple of more fixes or improvements to the release?
> > If
> > > > it's
> > > > > a 

Re: [DISCUSS] Enable Github Discussions

2021-08-11 Thread Balaji Varadarajan
+1

Balaji.V

On Wed, Aug 11, 2021 at 7:12 PM Bhavani Sudha 
wrote:

> +1
>
> Thanks,
> Sudha
>
> On Wed, Aug 11, 2021 at 7:08 PM vino yang  wrote:
>
> > +1
> >
> > Best,
> > Vino
> >
> > Pratyaksh Sharma  于2021年8月12日周四 上午2:16写道:
> >
> > > +1
> > >
> > > I have never used it, but we can try this out. :)
> > >
> > > On Thu, Jul 15, 2021 at 9:43 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to propose that we explore the use of github
> discussions.
> > > Few
> > > > other apache projects have also been trying this out.
> > > >
> > > > Please chime in
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>


Re: please give me the contributor permission

2021-01-27 Thread Balaji Varadarajan
 Welcome to Apache Hudi Community !! 
I have given contributor permissions. Looking forward to your contributions !!
Balaji.V
On Monday, January 25, 2021, 06:23:57 PM PST, jiangjiguang719 
 wrote:  
 
 Hi,

I want to contribute to Apache Hudi.

Would you please give me the contributor permission?

My JIRA ID is jiangjiguang0719  

Re: [VOTE] Release 0.7.0, release candidate #1

2021-01-21 Thread Balaji Varadarajan
 +1 (binding)
1. Ran release validation script successfully.2. Build successful3. Quickstart 
succeeded. 
Checking Checksum of Source Release Checksum Check of Source Release - [OK]
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                Dload  Upload   Total   Spent    Left  Speed100 
34972  100 34972    0     0  68978      0 --:--:-- --:--:-- --:--:-- 
68842Checking Signature Signature Check - [OK]
Checking for binary files in source release No Binary Files in Source Release? 
- [OK]
Checking for DISCLAIMER DISCLAIMER file exists ? [OK]
Checking for LICENSE and NOTICE License file exists ? [OK] Notice file exists ? 
[OK]
Performing custom Licensing Check Licensing Check Passed [OK]
Running RAT Check RAT Check Passed [OK]

On Thursday, January 21, 2021, 12:44:15 AM PST, Vinoth Chandar 
 wrote:  
 
 Hi everyone,

Please review and vote on the release candidate #1 for the version 0.7.0,
as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 7F2A3BEB922181B06ACB1AA45F7D09E581D2BCB6 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "release-0.7.0-rc1" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Release Manager



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12348721


[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.7.0-rc1/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1027/

[5] https://github.com/apache/hudi/tree/release-0.7.0-rc1
  

Re: Congrats to our newest committers!

2020-12-03 Thread Balaji Varadarajan
 Very Well deserved !! Many congratulations to Satish and Prashant.
Balaji.V
On Thursday, December 3, 2020, 11:07:09 AM PST, Bhavani Sudha 
 wrote:  
 
 Congratulations Satish and Prashant!
On Thu, Dec 3, 2020 at 11:03 AM Pratyaksh Sharma  wrote:

Congratulations Satish and Prashant!

On Fri, Dec 4, 2020 at 12:22 AM Vinoth Chandar  wrote:

> Hi all,
>
> I am really happy to announce our newest set of committers.
>
> *Satish Kotha*: Satish has ramped very quickly across our entire code base
> and contributed bug fixes and also drove large, unique features like
> clustering, replace/overwrite which are about to go out in the 0.7.0
> release. These efforts largely complete parts of our vision and it could
> have happened without Satish.
>
> *Prashant Wason*: In addition to a number of patches, Prashant has been
> shouldering massive responsibility on RFC-15, and thanks to his efforts, we
> have a simplified design, very solid implementation right now, that is
> being tested now for 0.7.0 release again.
>
> Please join me in congratulating them on this great milestone!
>
> Thanks,
> Vinoth
>

  

Re: [DISCUSS] 0.7.0 release timelines

2020-12-02 Thread Balaji Varadarajan
 +1 for (2)
On Wednesday, December 2, 2020, 08:09:29 AM PST, vino yang 
 wrote:  
 
 +1 for option 2

Gary Li  于2020年12月2日周三 下午4:01写道:

> vote for option 2.
> 
> From: nishith agarwal 
> Sent: Wednesday, December 2, 2020 3:16 PM
> To: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] 0.7.0 release timelines
>
> I vote for option 2 as well.
>
> -Nishith
>
> On Tue, Dec 1, 2020 at 10:05 PM Bhavani Sudha 
> wrote:
>
> > I vote for option 2 too.
> >
> > On Tue, Dec 1, 2020 at 7:36 PM Sivabalan  wrote:
> >
> > > I would vote for Option2 given that features are already being tested.
> if
> > > it's half way through development, may be would have given it a
> thought.
> > > But let's hear from the community.
> > >
> > >
> > > On Mon, Nov 30, 2020 at 8:15 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > We still have a few features to land for the 0.7.0 release.
> > Specifically,
> > > > RFC-15 and Clustering have PRs, undergoing test/production validation
> > at
> > > > the moment.
> > > >
> > > > Based on the JIRAs, I see two options
> > > >
> > > > Option 1:  Cut RC by next week or so, and push out the larger
> features
> > > to a
> > > > (hopefully quick) 0.8.0. We already have a few large features in
> > > > master/pending PRs (spark3, flink, replace/overwrite etc..)
> > > > Option 2:  Wait till December end to cut RC, with all the originally
> > > > planned feature set.
> > > >
> > > > Please chime in with your thoughts.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>  

Re: why not use spark datasource in DeltaStreamer

2020-12-01 Thread Balaji Varadarajan
 Regarding rdd vs dataframe, the historical reason is that RDD provided more 
control with low level API needed for Hudi to managing various aspects of 
writing. 
On a related note, If you look at the current  approach with Flink support, the 
input batch is getting parameterized to support different processing engines.
On Tuesday, December 1, 2020, 02:08:05 AM PST, songj songj 
 wrote:  
 
 thanks for reply!
could you help to explain my 2 questions  above?

Trevor  于2020年12月1日周二 下午5:17写道:

> Hi,songj ,
>
> DeltaStreamer can be understood as a packaged Spark DataSource. You only
> need to set the required parameters, which makes it more convenient for
> data ingest.
>
> Best,
>
> Trevor
>
>
> wowtua...@gmail.com
>
> From: songj songj
> Date: 2020-12-01 16:48
> To: dev
> Subject: Re: why not use spark datasource in DeltaStreamer
> spark structured streaming consume kafka using kafka data source, and
> foreachbatch to do insert/upsert/... to hudi,
> is it similar with DeltaStreamer?
>
> songj songj  于2020年12月1日周二 下午4:28写道:
>
> > hi, I have some questions:
> >
> > 1. DeltaStreamer  has its own Source> to consume source
> > data,
> > such as Kafka, why not use spark datasource directly ?
> >
> > 2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?
> >
> > I just want to know the background of the above implementation, thanks!
> >
>  

Re: Reg weekly sync meeting

2020-11-02 Thread Balaji Varadarajan
 +1
On Sunday, November 1, 2020, 09:13:44 PM PST, Gary Li 
 wrote:  
 
 +1 for biweekly meeting.
Gary LiFrom: Vinoth Chandar 
Sent: Friday, October 30, 2020 2:01:22 PM
To: dev@hudi.apache.org ; us...@hudi.apache.org 

Subject: Re: Reg weekly sync meeting + users list as well.

On Thu, Oct 29, 2020 at 10:59 PM Bhavani Sudha 
wrote:

> Hello all,
> I was wondering if it would make sense to move the weekly sync meeting to
> bi-weekly to amortize time and be efficient, especially since people across
> different time zones attend. We could still retain the same time but change
> the cadence to one in two weeks instead. What do you think?
>
> Thanks,
> Sudha
>
  

Re: Hudi-1365

2020-11-02 Thread Balaji Varadarajan
 Hi Selvaraj, I have replied in the jira.
Thanks,Balaji.VOn Sunday, November 1, 2020, 01:17:05 AM PST, selvaraj 
periyasamy  wrote:  
 
 Team,

Could you look into Hudi-1365? Performance is really heavily impacted for
some reasons .

Thanks,
Selva
  

Re: I want to contribute to Apache Hudi

2020-10-29 Thread Balaji Varadarajan
 Welcome to Apache Hudi community. I have added you as a contributor in Jira.
Balaji.V
On Wednesday, October 28, 2020, 08:11:00 PM PDT, jack_zhangsj 
 wrote:  
 
 Hi,

I want to contribute to Apache Hudi. Would you please give me the contributor 
permission? My JIRA ID is  jack_zhangsj .
Thanks !
 
jack  

Re: [EXT] Re: Bucketing in Hudi

2020-10-26 Thread Balaji Varadarajan
 

On Monday, October 26, 2020, 10:01:44 AM PDT, Roopa Murthy 
 wrote:  
 
 #yiv9486299699 #yiv9486299699 -- _filtered {} _filtered {} _filtered 
{}#yiv9486299699 #yiv9486299699 p.yiv9486299699MsoNormal, #yiv9486299699 
li.yiv9486299699MsoNormal, #yiv9486299699 div.yiv9486299699MsoNormal 
{margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9486299699 a:link, 
#yiv9486299699 span.yiv9486299699MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv9486299699 
span.yiv9486299699EmailStyle18 
{font-family:sans-serif;color:windowtext;}#yiv9486299699 
.yiv9486299699MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9486299699 
div.yiv9486299699WordSection1 {}#yiv9486299699 
Hi Balaji,
 
  
 
Surely that will work.
 
  
 
However, we would like to discuss with you and analyze the efforts as well as 
estimate the timelines to get all the relevant changes in. We are evaluating 
other tools as well and our choice would be based on ease of use and amount of 
changes.
 
  
 
When would be a good time to chat today or tomorrow? 
 
  
 
Thanks,
 
Roopa
 
  
 
From: Balaji Varadarajan 
Date: Thursday, October 22, 2020 at 9:24 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: Re: [EXT] Re: Bucketing in Hudi
 
  
 
Hi Roopa,
 
  
 
Bucketing is a more general concept. I think what you are referring to is how 
to integrate with spark sql bucketing syntax. I was proposing a Hudi native 
solution where we can implement Bucket indexing which gives the same end result 
of writing compacted (parquet) files with keys hashed to get bucket-id. You can 
then use the Hudi's Spark data source integration to write to this table and 
get bucketized organization.
 
  
 
Let me know if this makes sense. 
 
  
 
Thanks,
 
Balaji.V
 
  
 
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy 
 wrote:
 
  
 
  
 
Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am 
not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered 
fashion such that when a spark sql query has a certain id, only the 
bucket(file) which hashes to that id would be scanned for matching records. 
This means, data during compaction has to be written using Spark’s saveAsTable 
API with bucketBy set to the desired number of buckets. 
Refer:https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html.
 This will create a spark bucketed table having metadata different from Hive 
bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan 
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing 
:https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D=0>
 


You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa



   

Re: [EXT] Re: Bucketing in Hudi

2020-10-26 Thread Balaji Varadarajan
 Hi Roopa,
Kindly ping me in hoodie slack to work out the time within next couple of days. 
I would also like to understand your use-case better.
Thanks,Balaji.V


On Monday, October 26, 2020, 10:01:44 AM PDT, Roopa Murthy 
 wrote:  
 
 #yiv9486299699 #yiv9486299699 -- _filtered {} _filtered {} _filtered 
{}#yiv9486299699 #yiv9486299699 p.yiv9486299699MsoNormal, #yiv9486299699 
li.yiv9486299699MsoNormal, #yiv9486299699 div.yiv9486299699MsoNormal 
{margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9486299699 a:link, 
#yiv9486299699 span.yiv9486299699MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv9486299699 
span.yiv9486299699EmailStyle18 
{font-family:sans-serif;color:windowtext;}#yiv9486299699 
.yiv9486299699MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9486299699 
div.yiv9486299699WordSection1 {}#yiv9486299699 
Hi Balaji,
 
  
 
Surely that will work.
 
  
 
However, we would like to discuss with you and analyze the efforts as well as 
estimate the timelines to get all the relevant changes in. We are evaluating 
other tools as well and our choice would be based on ease of use and amount of 
changes.
 
  
 
When would be a good time to chat today or tomorrow? 
 
  
 
Thanks,
 
Roopa
 
  
 
From: Balaji Varadarajan 
Date: Thursday, October 22, 2020 at 9:24 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: Re: [EXT] Re: Bucketing in Hudi
 
  
 
Hi Roopa,
 
  
 
Bucketing is a more general concept. I think what you are referring to is how 
to integrate with spark sql bucketing syntax. I was proposing a Hudi native 
solution where we can implement Bucket indexing which gives the same end result 
of writing compacted (parquet) files with keys hashed to get bucket-id. You can 
then use the Hudi's Spark data source integration to write to this table and 
get bucketized organization.
 
  
 
Let me know if this makes sense. 
 
  
 
Thanks,
 
Balaji.V
 
  
 
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy 
 wrote:
 
  
 
  
 
Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am 
not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered 
fashion such that when a spark sql query has a certain id, only the 
bucket(file) which hashes to that id would be scanned for matching records. 
This means, data during compaction has to be written using Spark’s saveAsTable 
API with bucketBy set to the desired number of buckets. 
Refer:https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html.
 This will create a spark bucketed table having metadata different from Hive 
bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan 
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing 
:https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D=0>
 


You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa



   

Re: [EXT] Re: Bucketing in Hudi

2020-10-22 Thread Balaji Varadarajan
 Hi Roopa,
Bucketing is a more general concept. I think what you are referring to is how 
to integrate with spark sql bucketing syntax.  I was proposing a Hudi native 
solution where we can implement Bucket indexing which gives the same end result 
of writing compacted (parquet) files with keys hashed to get bucket-id. You can 
then use the Hudi's Spark data source integration to write to this table and 
get bucketized organization.
Let me know if this makes sense. 

Thanks,Balaji.V
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy 
 wrote:  
 
 Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am 
not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered 
fashion such that when a spark sql query has a certain id, only the 
bucket(file) which hashes to that id would be scanned for matching records. 
This means, data during compaction has to be written using Spark’s saveAsTable 
API with bucketBy set to the desired number of buckets. Refer: 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html
 . This will create a spark bucketed table having metadata different from Hive 
bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan 
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing : 
https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D=0>

You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa



  

Re: Bucketing in Hudi

2020-10-21 Thread Balaji Varadarajan
 Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing : 
https://issues.apache.org/jira/browse/HUDI-55 
You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.
Thanks,Balaji.V



On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:  
 
 Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa



  

Re: Deleting Hudi Partitons

2020-10-21 Thread Balaji Varadarajan
 
Fixing incorrect Satish's email.On Wednesday, October 21, 2020, 06:19:43 PM 
PDT, Balaji Varadarajan  wrote:  
 
  cc Satish who implemented Insert Overwrite support.
We have recently landed Insert Overwrite support in Hudi. Partition level 
deletion is a logical extension of this feature but not currently available 
yet.  I have added a jira to track this : 
https://issues.apache.org/jira/browse/HUDI-1350
Meanwhile, using master branch, you can do this in 2 steps. You can generate a 
record for each partition you want to delete and commit the batch. This would 
essentially truncate the partition to 1 record. You can then issue a hard 
delete on that record.  By keeping cleaner retention to 1, you can essentially 
cleanup the files in the directory. Satish - Can you chime in and see if this 
makes sense and if you are seeing any issues with this ?
Thanks,Balaji.V 
    On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy 
 wrote:  
 
 Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva
    

Re: Deleting Hudi Partitons

2020-10-21 Thread Balaji Varadarajan
 cc Satish who implemented Insert Overwrite support.
We have recently landed Insert Overwrite support in Hudi. Partition level 
deletion is a logical extension of this feature but not currently available 
yet.  I have added a jira to track this : 
https://issues.apache.org/jira/browse/HUDI-1350
Meanwhile, using master branch, you can do this in 2 steps. You can generate a 
record for each partition you want to delete and commit the batch. This would 
essentially truncate the partition to 1 record. You can then issue a hard 
delete on that record.  By keeping cleaner retention to 1, you can essentially 
cleanup the files in the directory. Satish - Can you chime in and see if this 
makes sense and if you are seeing any issues with this ?
Thanks,Balaji.V 
On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy 
 wrote:  
 
 Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva
  

Re: Hudi - Concurrent Writes

2020-10-19 Thread Balaji Varadarajan
 
We are planning to add parallel writing to Hudi (at different partition) levels 
in the next release.
Balaji.V On Friday, October 16, 2020, 11:54:51 PM PDT, tanu dua 
 wrote:  
 
 Hi,
Do we have a support of concurrent writes in 0.6 as I got a similar
requirement to ingest parallely from multiple jobs ? I am ok even if
parallel writes are supported with different partitions.

On Thu, 9 Jul 2020 at 9:22 AM, Vinoth Chandar  wrote:

> We are looking into adding support for parallel writers in 0.6.0. So that
> should help.
>
> I am curious to understand though why you prefer to have 1000 different
> writer jobs, as opposed to having just one writer. Typical use cases for
> parallel writing I have seen are related to backfills and such.
>
> +1 to Mario’s comment. Can’t think of anything else if your users are happy
> querying 1000 tables.
>
> On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> wrote:
>
> > hey Shayan,
> >
> > that seems actually a very good approach ... just curious with the glue
> > metastore you mentioned. Would it be an external metastore for spark to
> > query over ??? external in terms of not managed by Hudi ???
> >
> > that would be my only concern ... how to maintain the sync between all
> > metadata partitions but , again, a very promising approach !
> >
> > regards,
> >
> > Mario.
> >
> > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
> > escreveu:
> >
> > > Hi folks,
> > >
> > > We have a use-case where we want to ingest data concurrently for
> > different
> > > partitions. Currently Hudi doesn't support concurrent writes on the
> same
> > > Hudi table.
> > >
> > > One of the approaches we were thinking was to use one hudi table per
> > > partition of data. So let us say we have 1000 partitions, we will have
> > 1000
> > > Hudi tables which will enable us to write concurrently on each
> partition.
> > > And the metadata for each partition will be synced to a single
> metastore
> > > table (Assumption here is schema is same for all partitions). So this
> > > single metastore table can be used for all the spark, hive queries when
> > > querying data. Basically this metastore glues all the different hudi
> > table
> > > data together in a single table.
> > >
> > > We already tested this approach and its working fine and each partition
> > > will have its own timeline and hudi table.
> > >
> > > We wanted to know if there are some gotchas or any other issues with
> this
> > > approach to enable concurrent writes? Or if there are any other
> > approaches
> > > we can take?
> > >
> > > Thanks,
> > > Shayan
> > >
> >
>  

Re: Hudi Query Latest Records

2020-10-09 Thread Balaji Varadarajan
 The table description looks ok. Are you seeing an exception or incorrect data. 
This might require some debugging. Please open a support github ticket and we 
will look at it . Please provide same query output in hive and spark along with 
file listings of your dataset and .hoodie folder.
Thanks,Balaji.V
On Friday, October 9, 2020, 01:25:58 AM PDT, Ranganath Tirumala 
 wrote:  
 
 Hi Balaji,

Here is the desc formatted

col_name    data_type    comment    
# col_name                data_type              comment                
    NULL    NULL    
_hoodie_commit_time    string        
_hoodie_commit_seqno    string        
_hoodie_record_key    string        
_hoodie_partition_path    string        
_hoodie_file_name    string        
ee_id    bigint        
er_id    bigint        
evnt_src    string        
evnt_typ    string        
evnt_confidence    string        
evnt_yr    string        
evnt_src_id    string        
evnt_amt    string        
evnt_prtn    string        
evnt_sys_dt    string        
evnt_bus_dt    string        
evnt_strt_dt    string        
evnt_end_dt    string        
evnt_id    string        
    NULL    NULL    
# Detailed Table Information    NULL    NULL    
Database:              default              NULL    
OwnerType:              USER                    NULL    
Owner:                  user999                  NULL    
CreateTime:            Wed Oct 07 22:17:42 AEDT 2020    NULL    
LastAccessTime:        UNKNOWN                NULL    
Retention:              0                      NULL    
Location:              hdfs://path-to-external-table    NULL    
Table Type:            EXTERNAL_TABLE          NULL    
Table Parameters:    NULL    NULL    
    EXTERNAL                TRUE                    
    last_commit_time_sync    20201009072526          
    numFiles                2619                    
    totalSize              51903292933            
    transient_lastDdlTime    1602069462              
    NULL    NULL    
# Storage Information    NULL    NULL    
SerDe Library:
    org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe    NULL
InputFormat:            org.apache.hudi.hadoop.HoodieParquetInputFormat    NULL 
   
OutputFormat:
    org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat    NULL
Compressed:            No                      NULL    
Num Buckets:            -1                      NULL    
Bucket Columns:        []                      NULL    
Sort Columns:          []                      NULL    
Storage Desc Params:    NULL    NULL    
    serialization.format    1


On Fri, 9 Oct 2020 at 19:07, Balaji Varadarajan 
wrote:

>  Can you paste the detailed hive table description. (desc formatted .)
> Balaji.V
>    On Friday, October 9, 2020, 12:37:19 AM PDT, Ranganath Tirumala <
> ranganath.tirum...@gmail.com> wrote:
>
>  Hi Balaji,
>
> I cannot get this to work on hive / hue.
> It works as expected using spark shell.
>
> Any idea how I can get this to work in hive / hue?
>
> Regards,
>
> Ranganath
>
> On Thu, 1 Oct 2020 at 09:45, Balaji Varadarajan  >
> wrote:
>
> >  Assuming commit1 happened before commit2, this is what you should expect
> > when running a standard query through query engines.
> > Balaji.V
> >
> >    On Tuesday, September 29, 2020, 03:04:17 PM PDT, Ranganath Tirumala <
> > ranganath.tirum...@gmail.com> wrote:
> >
> >  Hi,
> >
> > Is there a way we can query to get the latest record across commits?
> >
> > e.g.
> > commit-1
> > Record-1, Value A
> > Record-2, Value A
> >
> > commit-2
> > Record-1, Value B
> > Record-3, Value B
> >
> > desired output
> > Record-1, Value B
> > Record-2, Value A
> > Record-3, Value B
> >
> > --
> > Regards,
> >
> > Ranganath Tirumala
> >
>
>
>
> --
> Regards,
>
> Ranganath Tirumala
>



-- 
Regards,

Ranganath Tirumala
  

Re: Hudi Query Latest Records

2020-10-09 Thread Balaji Varadarajan
 Can you paste the detailed hive table description. (desc formatted .)
Balaji.V
On Friday, October 9, 2020, 12:37:19 AM PDT, Ranganath Tirumala 
 wrote:  
 
 Hi Balaji,

I cannot get this to work on hive / hue.
It works as expected using spark shell.

Any idea how I can get this to work in hive / hue?

Regards,

Ranganath

On Thu, 1 Oct 2020 at 09:45, Balaji Varadarajan 
wrote:

>  Assuming commit1 happened before commit2, this is what you should expect
> when running a standard query through query engines.
> Balaji.V
>
>    On Tuesday, September 29, 2020, 03:04:17 PM PDT, Ranganath Tirumala <
> ranganath.tirum...@gmail.com> wrote:
>
>  Hi,
>
> Is there a way we can query to get the latest record across commits?
>
> e.g.
> commit-1
> Record-1, Value A
> Record-2, Value A
>
> commit-2
> Record-1, Value B
> Record-3, Value B
>
> desired output
> Record-1, Value B
> Record-2, Value A
> Record-3, Value B
>
> --
> Regards,
>
> Ranganath Tirumala
>



-- 
Regards,

Ranganath Tirumala
  

Re: Hudi Query Latest Records

2020-09-30 Thread Balaji Varadarajan
 Assuming commit1 happened before commit2, this is what you should expect when 
running a standard query through query engines.
Balaji.V

On Tuesday, September 29, 2020, 03:04:17 PM PDT, Ranganath Tirumala 
 wrote:  
 
 Hi,

Is there a way we can query to get the latest record across commits?

e.g.
commit-1
Record-1, Value A
Record-2, Value A

commit-2
Record-1, Value B
Record-3, Value B

desired output
Record-1, Value B
Record-2, Value A
Record-3, Value B

-- 
Regards,

Ranganath Tirumala
  

Re: Apache Hudi Data Reconciliation

2020-09-12 Thread Balaji Varadarajan
 
Hi Jialun,
There is no outside documentation for this case except Javadocs 
(https://issues.apache.org/jira/browse/HUDI-1277).  The payload interface are 
themselves first class citizens of Hudi ( 
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java).
 
We will add a generic support for this case 
(https://issues.apache.org/jira/browse/HUDI-1278) . You can implement a 
specific implementation for your case or you can also contribute to HUDI-1278 
and I can work with you to get this landed.
Thanks,Balaji.V





On Thursday, September 10, 2020, 11:05:44 AM PDT, Jialun Liu 
 wrote:  
 
 Hey Gray,

Thanks for replying so quickly!

Could you please point me to the documentation of this feature? I would
love to take a closer look at it, thanks!

Best regards,
Bill

On Thu, Sep 10, 2020 at 12:20 AM Gary Li  wrote:

> Hello.
> Yes this feature was supported by Hudi. You can write your own payload
> class to handle precombine(dedup within delta) and
> updateHistoryRecord(delta merge with history). The default payload is
> updateWithLatestRecord.
>
> Gary Li
> 
> From: Jialun Liu 
> Sent: Thursday, September 10, 2020 1:28:09 PM
> To: dev@hudi.apache.org 
> Subject: Apache Hudi Data Reconciliation
>
> Hey guys,
>
> I want to confirm if Apache Hudi has the capability of handling data
> reconciliation for use cases like late record, out of order records, retry
> etc.
>
> A simple example:
> @11:00
> RecordA, updatedAt = 11:00 (failed to update)
>
> @11:30
> RecordA, updatedAt = 11:30 (success)
>
> @12:00 (Retry the failed update)
> RecordA, updatedAt = 11:00 (should drop the record since it is stale)
>
> I know delta lake can update based on conditions so that I can use the
> updatedAt timestamp as the key. But how does Hudi do data reconciliation?
>
> Best regards,
> Bill
>
  

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-11 Thread Balaji Varadarajan
 Hi Raymond,

HoodieROPathFilter is supposed to return true only for file matches belonging 
to latest version if the path refers to a Hudi partition or if the path refers 
to a non-hoodie partition or dataset.  
I looked at the test-case you referred. It only works because the path filter 
wrongly assumes it is a non-hoodie path. You can run this in debug mode to see 
the code path.  From the usage perspective, this is used only from Spark 
(InMemoryFileIndex) where only the files are passed to this filter. So, I 
wouldn't classify this as a bug. But, it makes sense to make it consistent for 
both cases 
Balaji.V

Thanks,Balaji.VOn Wednesday, September 9, 2020, 07:45:09 AM PDT, Raymond Xu 
 wrote:  
 
 Hi Balaji, not sure if I fully get it.
I'm attempting to refer to this test case
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65

where a partition path is supposed to be accepted.
If I change L64 to
Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri());

Then it resulted in not being accepted due to partitionPath ending with `/`
(a directory path). To me, this seems to be a corner case not being
covered. Could you kindly confirm the expectation please? Thanks.

On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan
 wrote:

>  Hi Raymond,
> IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work
> correctly (e.g: "base/partition/*"). The path-cache is at partition level
> and not at table level so we need to extract the partition-path correctly
> to be used as look-up key. To extract partition-path, the challenge here is
> "Path" type does not have APIs to quickly figure if a path is a directory
> or not and we should avoid making RPC calls here.
> Thanks,Balaji.V
>    On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu <
> xu.shiyan.raym...@gmail.com> wrote:
>
>
> https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121
>
> As shown in the 2 lines above, it does not seem to work with directory
> Path.
> It should work for both `new Path("base/partition")` and `new
> Path("base/partition/")`, but it only works for the former case. In the
> latter case, `folder` will be "base/partition" and `path` will be
> "base/partition/", which will always result in returning false.
> A potential bug?
>
  

Re: Request to Add in Contributor list

2020-09-09 Thread Balaji Varadarajan
 Added. Welcome to Hudi community. 
Balaji.V
On Tuesday, September 8, 2020, 09:31:37 PM PDT, Mani Jindal 
 wrote:  
 
 Hi team

Please guide me how can i request for the contributor access for jira so
that i can assign some jira tickets to myself and contribute to the hudi
community.

JIRA Username:  *manijndl77*
Email:  *manijn...@gmail.com *
Full Name : *Mani Jindal*

Thanks and Regards
Mani Jindal
  

Re: [Question] Redundant release tag?

2020-09-08 Thread Balaji Varadarajan
 
Deleted.
Thanks,Balaji.VOn Tuesday, September 8, 2020, 08:51:36 PM PDT, Raymond Xu 
 wrote:  
 
 I think there is a mistakenly created version tag 0.60 in JIRA; the number
does not seem to follow the release format.
Anyone care to delete this?
https://issues.apache.org/jira/projects/HUDI/versions/12348551
  

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-08 Thread Balaji Varadarajan
 Hi Raymond,
IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work 
correctly (e.g: "base/partition/*"). The path-cache is at partition level and 
not at table level so we need to extract the partition-path correctly to be 
used as look-up key. To extract partition-path, the challenge here is "Path" 
type does not have APIs to quickly figure if a path is a directory or not and 
we should avoid making RPC calls here. 
Thanks,Balaji.V
On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu 
 wrote:  
 
 
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121

As shown in the 2 lines above, it does not seem to work with directory
Path.
It should work for both `new Path("base/partition")` and `new
Path("base/partition/")`, but it only works for the former case. In the
latter case, `folder` will be "base/partition" and `path` will be
"base/partition/", which will always result in returning false.
A potential bug?
  

Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-08 Thread Balaji Varadarajan
 +1
On Tuesday, September 8, 2020, 05:54:52 PM PDT, Mehrotra, Udit 
 wrote:  
 
 I am okay with this too.

On 9/8/20, 5:33 PM, "Raymond Xu"  wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    I'm ok with 1 hr earlier.

    On Tue, Sep 8, 2020, 5:09 PM Vinoth Chandar  wrote:

    > Anyone else wants to chime in for a new time, that works for everyone?
    >
    > Personally, I can do this time.
    >
    >  love to hear more inputs.
    >
    > On Wed, Sep 2, 2020 at 10:16 AM Pratyaksh Sharma 
    > wrote:
    >
    > > Hi everyone,
    > >
    > > Currently we are having weekly sync ups between 9 PM - 10 PM PST on
    > > tuesdays. Since I have switched my job last to last month (in India),
    > this
    > > time is exactly clashing with the daily standup time at my current org.
    > > This is the reason I have not been able to attend the syncups for quite
    > > some time.
    > >
    > > Hence just wanted to check with everyone if we could move the sync up
    > time
    > > to 1 hour before, i.e have it from 8 PM - 9 PM every tuesday? Please let
    > me
    > > know if this is suitable.
    > >
    >

  

Re: schema compatibility check and change column type

2020-09-07 Thread Balaji Varadarajan
 Hi Ji, 
Moving this discussion to https://github.com/apache/hudi/issues/2063 which you 
have opened. I have added a possible workaround in the comments. Please try it 
out and respond in the issue. 
Thanks,Balaji.V

On Monday, September 7, 2020, 10:11:13 AM PDT, Jl Liu (cadl) 
 wrote:  
 
 Thanks~ 

I got another question about schema evolution. I don’t found document on 
homepage and wiki. If I change type from INT to LONG, will Audi overwrite total 
parquet files of the partition? 

I disable schema compatibility check and write LONG type data to existed INT 
type hudi table successfully, but got “Parquet column cannot be converted in 
file xxx.parquet. Column: [xxx], Expected: int, Found: INT64” error on read. It 
seems like some parquet files with different schema stored in the same 
directory, I can’t read them together.



> 2020年9月8日 上午12:30,Sivabalan  写道:
> 
> Actually, I guess it is a bug in hudi. reader and writer schema arguments
> are called wrongly. (reader is sent for writer and writer is sent for
> reader). Will file a bug. Then, as you expect, INT should be evolvable to
> LONG, where as vice versa is incompatible.
> 
> 
> On Mon, Sep 7, 2020 at 12:17 PM Sivabalan  wrote:
> 
>> Hudi relies on avro's Schema compatability check. Looks like as per avro
>> SchemaCompatability, INT can't be evolved to a LONG, but LONG to INT is
>> allowed.
>> 
>> Check line no 339 here
>> 
>> .
>> Also, check their test case here
>> 
>>  at
>> line 44.
>> 
>> 
>> 
>> On Mon, Sep 7, 2020 at 12:02 PM Prashant Wason 
>> wrote:
>> 
>>> Yes, the schema change looks fine. That would mean its an issue with the
>>> schema compatibility checker. The are explicit checks for such cases so
>>> can't say where the issue lies.
>>> 
>>> I am out on a vacation this week. I will look into this as soon as I am
>>> back.
>>> 
>>> Thanks
>>> Prashant
>>> 
>>> On Sun, Sep 6, 2020, 11:18 AM Vinoth Chandar  wrote:
>>> 
 That does sound like a backwards compatible change.
 @prashant , any ideas here? (since you have the best context on the
>>> schema
 validation checks)
 
 On Thu, Sep 3, 2020 at 8:12 PM cadl  wrote:
 
> Hi All,
> 
> I want to change the type of one column in my COW table, from int to
 long.
> When I set “hoodie.avro.schema.validate = true” and upsert new data
>>> with
> long type, I got a “Failed upsert schema compatibility check” error.
 Dose
> it break backwards compatibility? If I disable
 hoodie.avro.schema.validate,
> I can upsert and read normally.
> 
> 
> code demo:
>>> https://gist.github.com/cadl/be433079747aeea88c9c1f45321cc2eb
> 
> stacktrace:
> 
> 
> org.apache.hudi.exception.HoodieUpsertException: Failed upsert schema
> compatibility check.
>  at
> 
 
>>> org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:572)
>  at
> 
 
>>> org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:190)
>  at
> 
 
>>> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:260)
>  at
> 
 
>>> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>  at
 org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
>  at
> 
 
>>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at
> 
 
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at
> 
 
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at
> 
 
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at
> 
 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at
> 
 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at
> 
 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at
> 
 
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at
> 
 
>>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at
 org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at
> 
 
>>> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at
> 
 
>>> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at
> 
 
>>> 

Re: Congrats to our newest committers!

2020-09-03 Thread Balaji Varadarajan
 Udit, Gary, Raymond and Pratyaksh,
Many congratulations :) Well deserved. Looking forward to your continued 
contributions.
Balaji.V
On Thursday, September 3, 2020, 07:19:45 PM PDT, Sivabalan 
 wrote:  
 
 Congrats to all 3. Much deserved and really excited to see more committers


On Thu, Sep 3, 2020 at 9:23 PM leesf  wrote:

> Congrats everyone, well deserved !
>
>
>
> selvaraj periyasamy  于2020年9月4日周五
>
> 上午5:05写道:
>
>
>
> > Congrats everyone !
>
> >
>
> > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar  wrote:
>
> >
>
> > > Hi all,
>
> > >
>
> > > I am really excited to share the good news about our new committers on
>
> > the
>
> > > project!
>
> > >
>
> > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct
> last
>
> > > year and immensely helped us making Hudi work well with the AWS
>
> > ecosystem.
>
> > > His most notable contributions are towards driving large parts of the
>
> > > implementation of RFC-12, Hive/Spark integration points. He has also
>
> > helped
>
> > > our users in various tricky issues.
>
> > >
>
> > > *Gary Li:* Gary is a great success story for the project, starting out
> as
>
> > > an early user and steadily grown into a strong contributor, who has
>
> > > demonstrated the ability to take up challenging implementations (e.g
>
> > Impala
>
> > > support, MOR snapshot query impl on Spark), as well as patiently
>
> > > iterate through feedback and evolve the design/code. He has also been
>
> > > helping users on Slack and mailing lists
>
> > >
>
> > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing
>
> > > lists, slack and github. He has been proposing immensely valuable
>
> > > test/tooling improvements. He has contributed a great deal of code as
>
> > well,
>
> > > towards the same. Many many users thank Raymond for the generous help
> on
>
> > > Slack.
>
> > >
>
> > > *Pratyaksh Sharma:* This is yet another great example of user ->
>
> > > contributor -> committer. Pratyaksh has been a great champion for the
>
> > > project, over the past year or so, steadily contributing many
>
> > improvements
>
> > > around the Delta Streamer tool.
>
> > >
>
> > > Please join me in, congratulating them on this well deserved milestone!
>
> > >
>
> > > Onwards and upwards,
>
> > > Vinoth
>
> > >
>
> >
>
> --
Regards,
-Sivabalan  

Re: Coding guidelines

2020-09-02 Thread Balaji Varadarajan
 +1. All current and future contributors/committers need to read this.
Balaji.V
On Wednesday, September 2, 2020, 01:11:46 AM PDT, vino yang 
 wrote:  
 
 +1 to have the coding guidelines.

Left some comments.

Best,
Vino

Vinoth Chandar  于2020年9月2日周三 上午9:51写道:

> Hello all,
>
> Put together a list to formalize the things we follow in code review
> process today. Please chime in on the PR review, for comments.
>
> https://github.com/apache/hudi/pull/2061
>
>
> Thanks
> Vinoth
>  

Re: [DISCUSS] Formalizing the release process

2020-09-01 Thread Balaji Varadarajan
 
+1 on the process.
Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary Li 
 wrote:  
 
 +1 
Gary LiFrom: Bhavani Sudha 
Sent: Wednesday, September 2, 2020 3:11:06 AM
To: us...@hudi.apache.org 
Cc: dev@hudi.apache.org 
Subject: Re: [DISCUSS] Formalizing the release process +1 on the release 
process formalization.

On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar  wrote:

> Hi all,
>
> Love to start a discussion around how we can formalize the release
> process, timelines more so that we can ensure timely and quality releases.
>
> Below is an outline of an idea that was discussed in the last community
> sync (also in the weekly sync notes).
>
> - We will do a "feature driven" major version release, every 3 months or
> so. i.e going from version x.y to x.y+1. The idea here is this ships once
> all the committed features are code complete, tested and verified.
> - We keep doing patches, bug fixes and usability improvements to the
> project always. So, we will also do a "time driven" minor version release
> x.y.z → x.y.z+1 every month or so
> - We will always be releasing from master and thus major release features
> need to be guarded by flags, on minor versions.
> - We will try to avoid patch releases. i.e cherry-picking a few commits
> onto an earlier release version. (during 0.5.3 we actually found the
> cherry-picking of master onto 0.5.2 pretty tricky and even error-prone).
> Some cases, we may have to just make patch releases. But only extenuating
> circumstances. Over time, with better tooling and a larger community, we
> might be able to do this.
>
> As for the major release planning process.
>
>    - PMC/Committers can come up with an initial list sourced based on
>    user asks, support issue
>    - List is shared with the community, for feedback. community can
>    suggest new items, re-prioritizations
>    - Contributors are welcome to commit more features/asks, (with due
>    process)
>
> I would love to hear +1s, -1s and also any new, completely different ideas
> as well. Let's use this thread to align ourselves.
>
> Once we align ourselves, there are some release certification tools that
> need to be built out. Hopefully, we can do this together. :)
>
>
> Thanks
> Vinoth
>
  

Re: HUDI-1232

2020-09-01 Thread Balaji Varadarajan
 Depending on the ordering of the jar is messy but if it works for you as a 
temporary measure, it should be ok :)
Balaji.V
On Tuesday, September 1, 2020, 12:44:23 AM PDT, selvaraj periyasamy 
 wrote:  
 
 Thanks Balaji. Since upgrade is not an immediate solution in a shared
cluster, I tried a workaround. Added
org.apache.hudi.hadoop.HoodieROTablePathFilter
class in a common project module and added the caching logic and created a
jar and then added common.jar before the hudi jar.  It is now able to use
custom class and takes care of caching. I can manage with tihs until we
upgrade.

spark2-submit --jars
/home/selva/common.jar,/home/selva/hudi-spark-bundle-0.5.0-incubating.jar
--conf spark.sql.hive.convertMetastoreParquet=false --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master yarn
--deploy-mode client --driver-memory 4g --executor-memory 10g
--num-executors 200 --executor-cores 1  --conf
spark.executor.memoryOverhead=4096 --conf
spark.shuffle.service.enabled=true  --class
com.test.cdp.reporting.trr.TRREngine
/home/seperiya/transformation-engine.jar

Thanks,
Selva

On Sat, Aug 29, 2020 at 12:55 PM Balaji Varadarajan
 wrote:

>  Hi Selvaraj,
> Yes, you are right. Sorry for the confusion. As mentioned in the release
> notes, Spark 2.4.4 runtime is needed although I dont remember what problem
> you will encounter with Spark 2.3.3. I think it will be a worthwhile
> exercise for you to upgrade to Spark 2.4.4 and Hudi latest versions as we
> had been and continuing to improve performance in Hudi :) For instance, the
> very next release will have consolidated metadata which would avoid file
> listing in the first place.
> THanks,Balaji.V    On Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj
> periyasamy  wrote:
>
>  Thanks Balaji,
>
> I am looking into the steps to upgrade to 0.6.0. I noticed the below
> content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
> It says the runtime spark version must be 2.4+. Little confused now. Could
> you shed more light on this?
> Release HighlightsPermalink
> <https://hudi.apache.org/releases.html#release-highlights-3>
>
>  - Dependency Version Upgrades
>      - Upgrade from Spark 2.1.0 to Spark 2.4.4
>      - Upgrade from Avro 1.7.7 to Avro 1.8.2
>      - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
>      - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
>      spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
>  - *IMPORTANT* This version requires your runtime spark version to be
>  upgraded to 2.4+.
>
> Thanks,
> Selva
>
> On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
>  wrote:
>
> >  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> > repeating which suggests this is the read side. So, we recommend you
> using
> > latest version. I tried 2.3.3 and ran quickstart without issues. Give it
> a
> > shot and let us know if there are any issues.
> > Balaji.V
> >    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> >  Thanks Balaji. My hadoop environment is still running with spark 2.3.
> Can
> > I
> > run 0.6.0 on spark 2.3?
> >
> > For issue 1: I am able to manage it with spark glob read, instead of
> > hive read. With this approach, I am good with this approach.
> >  Issue 2: I see the performance issue while writing into the COW table.
> > This is purely write and no read involved.  Attached the write logs (
> > hudiLogs.txt) in the ticket . The more and more my target has
> partitions, I
> > am noticing a spike in write time.  The fix #1919 mentioned is applicable
> > for writing as well.
> >
> > On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org 
> > wrote:
> >
> > >  Hi Selvaraj,
> > > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> > read
> > > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > > please try 0.6.0
> > > Balaji.V
> > >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > > selvaraj.periyasamy1...@gmail.com> wrote:
> > >
> > >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > > ticket
> > > for tracking a couple of issues.
> > >
> > > One of the concerns I have in my use cases is that, have a COW type
> table
> > > name called TRR.  I see below pasted logs rolling for all individual
> > > partitions even though my write is on only a couple of partitions  and
> it
> > > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> > wonder

Re: DevX, Test infra Rgdn

2020-08-31 Thread Balaji Varadarajan
 +1. This would be a great contribution as all developers will benefit from 
this work. 
On Monday, August 31, 2020, 08:07:08 AM PDT, Vinoth Chandar 
 wrote:  
 
 +1 this is a great way to also ramp on the code base

On Sun, Aug 30, 2020 at 8:00 AM Sivabalan  wrote:

> As Hudi matures as a project, we need to get our devX and test infra rock
> solid. Availability of test utils and base classes for ease of writing more
> tests, stable integration tests, ease of debuggability, micro benchmarks,
> performance test infra, automating checkstyle formatting, nightly snapshot
> builds and so on.
>
> We have identified and categorized these into different areas as below.
>
> - Test fixes and some clean up. // There are a lot of jira tickets
> lying around in this section.
> - Test refactoring. // For ease of development, and reduce clutter, we need
> to work on refactoring test infra like having more test utils, base classes
> etc.
> - More tests to improve coverage in some areas.
> - CI stability and ease of debugging integration tests.
> - Checkstyle, sl4j, warnings, spotless, etc.
> - Micro benchmarks. // add benchmarking framework to hudi. and then
> identify regressions on any key paths.
> - Long running test suite
> - Config clean ups in hudi client
> - Perf test environment
> - Nightly builds
>
> As we plan out work in each of these sections, we are looking for help from
> the community in getting these done. Plan is to put together a few umbrella
> tickets for each of these areas and will have a coordinator. Coordinator
> will be one who has expertise in the area of interest. Coordinator will
> plan out the work in their resp area and will help drive the initiative
> with help from the community depending on who volunteers to help out.
>
> I understand the list is huge. Some work areas will be well defined and
> should be able to get it done if we allocate enough time and resources. But
> some are exploratory in nature and need some initial push to get the ball
> rolling.
>
> Very likely some of the work items in these would be well defined and
> should be easy for new folks to contribute. We are not really having any
> target timeframe in mind(as we had 1 month for bug bash), but would like to
> get concrete work items done in decent time and have others ready by the
> next major release(for eg, perf test env) depending on resources.
>
> Let us know if you would be interested to help our community in this
> regard.
>
> --
> Regards,
> -Sivabalan
>
  

Re: Hudi Writer vs Spark Parquet Writer - Sync

2020-08-31 Thread Balaji Varadarajan
 Hi Felix, 
For read side performance, we are focussed on adding clustering support 
(https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance)
 and consolidated metadata 
(https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements)
 in the next release. The clustering support is much more generic and provides 
capability to dynamically organize the data to suit query performance. Please 
take a look at those RFCs. 
Balaji.V
On Sunday, August 30, 2020, 02:16:29 PM PDT, Kizhakkel Jose, Felix 
 wrote:  
 
 Hello All,

Hive has the bucketBy feature and spark is going to add support for HIVE style 
bucketBy support for data sources and once it’s implemented - its going to 
benefit largely on the read performance. So as HUDI is having different path 
while writing parquet data, are we planning to add bucketBy functionality? 
Seems Spark is adding features on writers to be benefitted for better read 
performance, so having a different writer for HUDI, are keeping track on these 
new features happening on Spark, therefore HUDI writer is not going to greatly 
differ from spark file (parquet) writer or lacking features?

Regards,
Felix K Jose



The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.
  

Re: HUDI-1232

2020-08-29 Thread Balaji Varadarajan
 Hi Selvaraj,
Yes, you are right. Sorry for the confusion. As mentioned in the release notes, 
Spark 2.4.4 runtime is needed although I dont remember what problem you will 
encounter with Spark 2.3.3. I think it will be a worthwhile exercise for you to 
upgrade to Spark 2.4.4 and Hudi latest versions as we had been and continuing 
to improve performance in Hudi :) For instance, the very next release will have 
consolidated metadata which would avoid file listing in the first place. 
THanks,Balaji.VOn Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj 
periyasamy  wrote:  
 
 Thanks Balaji,

I am looking into the steps to upgrade to 0.6.0. I noticed the below
content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
It says the runtime spark version must be 2.4+. Little confused now. Could
you shed more light on this?
Release HighlightsPermalink
<https://hudi.apache.org/releases.html#release-highlights-3>

  - Dependency Version Upgrades
      - Upgrade from Spark 2.1.0 to Spark 2.4.4
      - Upgrade from Avro 1.7.7 to Avro 1.8.2
      - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
      - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
      spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
  - *IMPORTANT* This version requires your runtime spark version to be
  upgraded to 2.4+.

Thanks,
Selva

On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
 wrote:

>  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> repeating which suggests this is the read side. So, we recommend you using
> latest version. I tried 2.3.3 and ran quickstart without issues. Give it a
> shot and let us know if there are any issues.
> Balaji.V
>    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
>  Thanks Balaji. My hadoop environment is still running with spark 2.3. Can
> I
> run 0.6.0 on spark 2.3?
>
> For issue 1: I am able to manage it with spark glob read, instead of
> hive read. With this approach, I am good with this approach.
>  Issue 2: I see the performance issue while writing into the COW table.
> This is purely write and no read involved.  Attached the write logs (
> hudiLogs.txt) in the ticket . The more and more my target has partitions, I
> am noticing a spike in write time.  The fix #1919 mentioned is applicable
> for writing as well.
>
> On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org 
> wrote:
>
> >  Hi Selvaraj,
> > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> read
> > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > please try 0.6.0
> > Balaji.V
> >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > ticket
> > for tracking a couple of issues.
> >
> > One of the concerns I have in my use cases is that, have a COW type table
> > name called TRR.  I see below pasted logs rolling for all individual
> > partitions even though my write is on only a couple of partitions  and it
> > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> wondering
> > , in the future , I will have 3 years worth of data, and writing will be
> > very slow every time I write into only a couple of partitions.
> >
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@fed0a8b
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/01, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.

Re: HUDI-1232

2020-08-29 Thread Balaji Varadarajan
 From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs 
repeating which suggests this is the read side. So, we recommend you using 
latest version. I tried 2.3.3 and ran quickstart without issues. Give it a shot 
and let us know if there are any issues.
Balaji.V
On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy 
 wrote:  
 
 Thanks Balaji. My hadoop environment is still running with spark 2.3. Can I
run 0.6.0 on spark 2.3?

For issue 1: I am able to manage it with spark glob read, instead of
hive read. With this approach, I am good with this approach.
 Issue 2: I see the performance issue while writing into the COW table.
This is purely write and no read involved.  Attached the write logs (
hudiLogs.txt) in the ticket . The more and more my target has partitions, I
am noticing a spike in write time.  The fix #1919 mentioned is applicable
for writing as well.

On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org 
wrote:

>  Hi Selvaraj,
> We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read
> queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> please try 0.6.0
> Balaji.V
>    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
>  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> ticket
> for tracking a couple of issues.
>
> One of the concerns I have in my use cases is that, have a COW type table
> name called TRR.  I see below pasted logs rolling for all individual
> partitions even though my write is on only a couple of partitions  and it
> takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
> , in the future , I will have 3 years worth of data, and writing will be
> very slow every time I write into only a couple of partitions.
>
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@fed0a8b
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/01, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@285c67a9
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/02, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO 

Re: Null-value for required field Error

2020-08-23 Thread Balaji Varadarajan
Can you open an issue and we will look into this there. To confirm the
theory, can you enable INFO logging and paste the output with the line:

"Registered avro schema : ..."

Can you also print the schema using inputDF.printSchema()

Thanks,
Balaji.V

On Fri, Aug 21, 2020 at 12:53 PM selvaraj periyasamy <
selvaraj.periyasamy1...@gmail.com> wrote:

> Thanks Balaji.
>
> could you please provide more info on how to get it done and pass it to
> hudi?
>
> Thanks,
> Selva
>
> On Fri, Aug 21, 2020 at 12:33 PM Balaji Varadarajan
>  wrote:
>
> >  Hi Selvaraj,
> > Even though the incoming batch has non null values for the new column,
> > existing data do not have this column. So, you need to make sure the avro
> > schema has the new column to be nullable and be backwards compatible.
> > Balaji.V
> > On Friday, August 21, 2020, 10:06:40 AM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> >  Hi,
> >
> > with 0.5.0 version of Hudi, I am using COW table type, which is
> > partitioned by mmdd format . We already have a table with
> Array
> > type columns and data populated. And then we are now trying to add a new
> > column ("rule_profile_id_list") in dataframes and while trying to write ,
> > getting below exception the below error message.  I am making sure that
> > DataFrame that I pass is having non null value as it is a non-nullable
> > column as per schema definition in dataframe.  I don't use "--conf
> > spark.sql.hive.convertMetastoreParquet=false" because I am already
> setting
> >  below code snippet handled in my code.
> >
> >
> >
> sparkSession.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
> > classOf[org.apache.hadoop.fs.PathFilter]);
> >
> >
> > Could someone help me to resolve this error?
> >
> > 20/08/21 08:38:30 WARN TaskSetManager: Lost task 8.0 in stage 151.0 (TID
> > 31217, sl73caehdn0811.visa.com, executor 10):
> > org.apache.hudi.exception.HoodieUpsertException: Error upserting
> bucketType
> > UPDATE for partition :8
> > at
> >
> >
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:264)
> > at
> >
> >
> org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
> > at
> >
> >
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
> > at
> >
> >
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
> > at
> >
> >
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> > at
> >
> >
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> > at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> > at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> > at
> >
> >
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
> > at
> >
> >
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
> > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
> > at
> >
> >
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
> > at
> >
> >
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
> > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> > at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> > at org.apache.spark.scheduler.Task.run(Task.scala:109)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker

Re: [VOTE] Release 0.6.0, release candidate #1

2020-08-22 Thread Balaji Varadarajan
+1(binding)
1. Ran long running structured streaming writes on fake data and verified
compactions and ingestion is happening without errors.
2. Ran both scala and python based quickstart without any errors. There was
an issue in the documented quickstart steps (not in hudi) for python
example. Will send a doc PR shortly.
3. Release Validation script passed locally.

```
MacBook-Pro:scripts balaji.varadarajan$
./release/validate_staged_release.sh --release=0.6.0 --rc_num=1
/tmp/validation_scratch_dir_001 ~/code/oss/upstream_hudi/scripts
Checking Checksum of Source Release
Checksum Check of Source Release - [OK]

  % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
 Dload  Upload   Total   SpentLeft
 Speed
100 30225  100 302250 0  46215  0 --:--:-- --:--:-- --:--:--
46145
Checking Signature
Signature Check - [OK]

Checking for binary files in source release
No Binary Files in Source Release? - [OK]

Checking for DISCLAIMER
DISCLAIMER file exists ? [OK]

Checking for LICENSE and NOTICE
License file exists ? [OK]
Notice file exists ? [OK]

Performing custom Licensing Check
Licensing Check Passed [OK]

Running RAT Check
RAT Check Passed [OK]

~/code/oss/upstream_hudi/scripts
MacBook-Pro:scripts balaji.varadarajan$ echo $?
0
MacBook-Pro:scripts balaji.varadarajan$
```


On Sat, Aug 22, 2020 at 8:27 AM Vinoth Chandar  wrote:

> +1 (binding)
>
> - Ran the rc checks, I typically do
> - Tested a smoke test on both cow, mor tables
> - by running lot commits over longer period of time,
> - verifying the state of the dataset
>- count validation match.
>
> On Sat, Aug 22, 2020 at 6:08 AM leesf  wrote:
>
> > +1 (binding)
> > - mvn clean package -DskipTests OK
> > - ran quickstart guide OK (still get the exception ERROR
> > view.PriorityBasedFileSystemView: Got error running preferred function.
> > Trying secondary
> > org.apache.hudi.exception.HoodieRemoteException: 192.168.1.102:56544
> > failed
> > to respond
> > at
> >
> >
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:426)
> > at
> >
> >
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:96)
> > at
> >
> >
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFile(PriorityBasedFileSystemView.java:139),
> > but still ran successfully)
> > - writing demos to sync to hive & dla OK
> >
> > Sivabalan  于2020年8月22日周六 上午5:29写道:
> >
> > > +1 (non binding)
> > > - Compilation successful
> > > - Ran validation script which verifies checksum, keys, license, etc.
> > > - Ran quick start
> > > - Ran some tests from intellij.
> > >
> > > JFYI: when I ran mvn test, encountered some test failures due to
> multiple
> > > spark contexts. Have raised a ticket here
> > > . But all tests are
> > > succeeding in CI and I could run from within intellij. So, not blocking
> > the
> > > RC.
> > >
> > > Checking Checksum of Source Release-e Checksum Check of Source Release
> -
> > > [OK]
> > >   % Total% Received % Xferd  Average Speed   TimeTime Time
> > > Current
> > >Dload  Upload
> > > Total   SpentLeft  Speed
> > > 100 30225  100 302250 0   106k  0 --:--:-- --:--:--
> --:--:--
> > > 106k
> > > Checking Signature
> > > -e Signature Check - [OK]
> > > Checking for binary files in source release
> > > -e No Binary Files in Source Release? - [OK]
> > > Checking for DISCLAIMER
> > > -e DISCLAIMER file exists ? [OK]
> > > Checking for LICENSE and NOTICE
> > > -e License file exists ? [OK]-
> > > e Notice file exists ? [OK]
> > > Performing custom Licensing Check
> > > -e Licensing Check Passed [OK]
> > > Running RAT Check
> > > -e RAT Check Passed [OK]
> > >
> > >
> > >
> > > On Fri, Aug 21, 2020 at 12:37 PM Bhavani Sudha <
> bhavanisud...@gmail.com>
> > > wrote:
> > >
> > > > Vino yang,
> > > >
> > > > I am working on the release blog. While the RC is in progress, the
> doc
> > > and
> > > > site updates are happening this week.
> > > >
> > > > Thanks,
> > > > Sudha
> > > >
> > > > On Fri, Aug 21, 2020 at 4:23 AM vino yang 
> > wrote:
> > > >
> > > > > +1 from my side
> > > > >
> > > > > I checked:
> > > > >
> > > > > - ran `mvn clean package` [OK]
> > > > > - ran `mvn test` in my local [OK]
> > > > > - signature [OK]
> > > > >
> > > > > BTW, where is like of the release blog?
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > Bhavani Sudha  于2020年8月20日周四 下午12:03写道:
> > > > >
> > > > > > Hi everyone,
> > > > > > Please review and vote on the release candidate #1 for the
> version
> > > > 0.6.0,
> > > > > > as follows:
> > > > > > [ ] +1, Approve the release
> > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > >
> > > > > > The complete staging area is available for your 

Re: Incremental query on partition column

2020-08-21 Thread Balaji Varadarajan
 Thanks for the detailed email David. We had discussed this in last week 
community meeting and Vinoth had ideas on how to implement this. This is 
something that can be supported by the timeline layout that Hudi has. It would 
be a new feature (new write operation) that basically appends the delete marker 
to all versions of the data instead of just the latest. 
Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
Balaji.V



On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia 
 wrote:  
 
 Hello,

I am Siva's colleague and I am working on the problem below as well.

I would like to describe what we are trying to achieve with Hudi as well as our 
current way of working and our GDPR and "Right To Be Forgotten " compliance 
policies.

Our requirements :
- We wish to apply a strict interpretation of the RTBF.  In other words, when 
we remove a person's data, it should be throughout the historical data and not 
just the latest snapshot.
- We wish to use Hudi to reduce our storage requirements using upserts and 
don't want to have duplicates between commits.
- We wish to retain history for persons who have not requested to be forgotten 
and therefore we do not want to delete commit files from the history as some 
have proposed.

We have tried a couple of solutions, but so far without success :
- replay the data omitting the data of the persons who have requested to be 
forgotten.  We wanted to manipulate the commit times to rebuild the history.
We found that we couldn't manipulate the commit times and retain the history.

- replay the data omitting the data of the persons who have requested to be 
forgotten, but writing to a date-based partition folder using the 
"partitionpath" parameter.
We found that commits using upserts between the partitionpath folders, do not 
ignore data that is unchanged between 2 commit dates as when using the default 
commit file system, so we will not save on our storage or speed up our  
processing using this technique.

So basically we would like to find a way to apply a strict RTBF, GDPR, maintain 
history and time-travel (large history) and save storage space using Hudi.

Can anyone see a way to achieve this?

Kind Regards,
David Rosalia


Get Outlook for Android


From: Vinoth Chandar 
Sent: Friday, August 14, 2020 8:26:22 AM
To: dev@hudi.apache.org 
Subject: Re: Incremental query on partition column

Hi,

On re-ingesting, do you mean to say you want to overwrite the table, while
not getting the changes in the incremental query?  This has not come up
before.
As you can imagine, it'd tricky scenario, where we need some special
handling/action type introduced.

yes, yes on the next two questions.
Commit. time can be controlled if using the HoodieWriteClient API, not on
datasource/deltastreamer atm

On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash 
wrote:

> Hi,
>
>
> What is the design that can be used/implemented when we re-ingest the data
> without affecting incremental query?
>
>
>
>    - Is it possible to maintain a delta dataset across partitions (
>    hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
>    - Can I do a snapshot query on across and specific partitions?
>    - Or, possible to control Hudi's commit time?
>
>
> Thanks
>  

Re: Null-value for required field Error

2020-08-21 Thread Balaji Varadarajan
 Hi Selvaraj,
Even though the incoming batch has non null values for the new column, existing 
data do not have this column. So, you need to make sure the avro schema has the 
new column to be nullable and be backwards compatible.
Balaji.V
On Friday, August 21, 2020, 10:06:40 AM PDT, selvaraj periyasamy 
 wrote:  
 
 Hi,

with 0.5.0 version of Hudi, I am using COW table type, which is
partitioned by mmdd format . We already have a table with Array
type columns and data populated. And then we are now trying to add a new
column ("rule_profile_id_list") in dataframes and while trying to write ,
getting below exception the below error message.  I am making sure that
DataFrame that I pass is having non null value as it is a non-nullable
column as per schema definition in dataframe.  I don't use "--conf
spark.sql.hive.convertMetastoreParquet=false" because I am already setting
 below code snippet handled in my code.

sparkSession.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);


Could someone help me to resolve this error?

20/08/21 08:38:30 WARN TaskSetManager: Lost task 8.0 in stage 151.0 (TID
31217, sl73caehdn0811.visa.com, executor 10):
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType
UPDATE for partition :8
at
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:264)
at
org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
at
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException:
org.apache.hudi.exception.HoodieException:
java.util.concurrent.ExecutionException: java.lang.RuntimeException:
Null-value for required field: rule_profile_id_list
at
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
at
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:178)
at
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:257)
... 28 more
Caused by: org.apache.hudi.exception.HoodieException:
java.util.concurrent.ExecutionException: java.lang.RuntimeException:
Null-value for required field: rule_profile_id_list
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:142)
at
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:200)
... 30 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Null-value for required field:
rule_profile_id_list
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:140)
... 31 more
Caused by: java.lang.RuntimeException: Null-value for required field:
rule_profile_id_list
at

Re: I want to contribute to Apache Hudi.

2020-08-20 Thread Balaji Varadarajan
 Welcome Trevor to Hudi community. It looks like you have been added to the 
contributor role.
Balaji.VOn Thursday, August 20, 2020, 11:07:47 AM PDT, wowtua...@gmail.com 
 wrote:  
 
 
I want to contribute to Apache Hudi.

Would you please give me the permission as a contributor ?

My JIRA username is Trevorzhang.


wowtua...@gmail.com
  

Re: [DISCUSS] Support Spark Structured Streaming read from Hudi table

2020-08-20 Thread Balaji Varadarajan
 Hi linshan,
Sorry for the delay in responding. It is better to discuss code changes over 
draft PR. Can you open one and tag us there. At a high level, it looks like you 
are using Spark Datasource v2 APIs while currently the structured streaming 
write is implemented using V1 API. Let's discuss this over a PR. We have few 
folks (Gary, Udit) who know about this part better than me. They can help you 
out here.
Balaji.V

On Tuesday, August 18, 2020, 08:03:01 PM PDT, linshan  
wrote:  
 
 hi team:
    I need  help,After a few days of thinking, trial and error, I have no 
idea.I wrote the relevant information on this page。Please follow this 
link(https://issues.apache.org/jira/browse/HUDI-1126)。
  
Best,
linshan-ma  

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

2020-08-20 Thread Balaji Varadarajan
 +1. This should be good to have as an option. If everybody agrees, please go 
ahead with RFC and we can discuss details there.
Balaji.VOn Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi 
 wrote:  
 
 Hi everyone!

I was hoping to discuss adding support for making `_hoodie_record_key` a
virtual column :)

Context:
Currently, _hoodie_record_key is written to DFS, as a column in the Parquet
file. In our production systems at Uber however, _hoodie_record_key
contains data that can be found in a different column (or set of columns).
This means that we are storing duplicated data.

Proposal:
In the interest of improving storage efficiency, we could add confs /
abstract classes that can construct the _hoodie_record_key given other
columns. That way we do not have to store duplicated data on DFS.

Any thoughts on this?

Best,
Modi
  

Re: Kafka Hudi pipeline design

2020-07-21 Thread Balaji Varadarajan
 Please see answers inline...

On Sunday, July 19, 2020, 10:08:09 PM PDT, Lian Jiang 
 wrote:  
 
 Hi,
I have a kafka topic using a kafka s3 connector to dump data into s3 hourly in 
parquet format. These parquet files are partitioned in ingestion time and each 
record has fields which are deeply nested jsons. Each record is a monolithic 
data containing multiple events each has its own event time. This causes two 
issues: 1. slow query by event time; 2. hard to use due to many levels of 
exploding. I plan to use the below design to solve these problems. 

In this design, I still use the s3 parquet dumped by the Kafka S3 connector as 
a backfill for the hudi pipeline. This is because the S3 connector pipeline is 
easier then the hudi pipeline to set up and will work before the hudi pipeline 
is working. Also, the s3 connector pipeline may be more reliable than the hudi 
pipeline due to the potential bugs in delta streamer.The delta streamer will 
decompose the monolithic kafka record into multiple event streams. Each event 
stream is written into one hudi dataset partition and sorted by its 
corresponding event time. Such hudi datasets are synced with hive which is 
exposed for user query so that they don't need to care whether the underlying 
table format is parquet or hudi.Hopefully, such design improves the query 
performance due to the fact that the data set is partitioned and sorted by 
event times as opposed to kafka ingest time. Also user experience is improved 
by querying the extracted events.

Let us know if you there are any issues with deltastreamer for it to be used in 
the first stage. If you want to faithfully append event stream logs to S3 
before you materialize in different order, you can try the "insert" mode in 
hudi, which would give you small file size handling. 

Questions:1. Do you see any issue for the delta streamer to handle both 
streaming and backfill at the same time? I know hudi dataset cannot be written 
by multiple writing clients simultaneously. Also, I don't want the delta 
streamer to stop handling the streaming data while doing backfill. The delta 
streamer will use dynamic allocation. Assuming the cluster has enough capacity, 
the load caused by backfill should not be an issue.

With 0.6, we are planning to allow multiple writers as long as there is 
guarantee that writers will be writing to different partitions. I think this 
will fit your requirement and also keep one timeline. 

2. If I want to time travel to a previous day (e.g. the first day 11:00:00AM 
PST of the last Month), how can I make hudi 1 and hudi 2 (... hudi n) in sync. 
AFAIK, hudi time travel is done by commit instead of timestamp. Should I do 
below: a. listing the commits of these hudi datasets, 
 b. finding the commits closing to each other and being closest to the desired 
timestamp, 
 c. apply time travel for each hudi dataset.Is there an easier and more 
accurate way? Will hudi support time travel by timestamp in the future as delta 
lake does? 


Commit time is like a timestamp although in specific format (secs). It should 
be straightforward to reformat a timestamp to commit time and then use it in 
the WHERE clause. But, I have opened a ticket 
https://issues.apache.org/jira/browse/HUDI-1116 to track this request. My 
initial thinking is this should not be hard to support. 

Balaji.V  

Re: Date handling in HUDI

2020-07-21 Thread Balaji Varadarajan
 
Gary/Udit,
As you are familiar with this part of it, Can you please answer this question ?
Thanks,Balaji.VOn Monday, July 20, 2020, 08:18:16 AM PDT, tanu dua 
 wrote:  
 
 Hi Guys,
May I know how do you guys handle date and time stamp in Hudi.
When I set DataTypes as Date in StructType it’s getting ingested as int but
when I query using spark sql I get the following

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-17557

So not sure if it’s only me who face this. Do I need to change to String ?  

Re: Hard Delete

2020-07-17 Thread Balaji Varadarajan
 Hi Sivaprakash,
You can configure cleaner to clean the older file versions which contain those 
records to be deleted. You can take a look at 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo
 for more details.

Balaji.V
On Friday, July 17, 2020, 07:47:55 AM PDT, Sivaprakash 
 wrote:  
 
 Hello

Do we have any option to delete a record from every partition? Which mean I
want to completely wipe out particular record from complete data set (first
commit, all the changes,  delta commit etc)

Currently, when I delete it affects only the last commit but if I do an
incremental query on the history it still has it - I want to remove them as
well. Possible?

Thank you !!
  

Re: Handling delta

2020-07-16 Thread Balaji Varadarajan
 Hi Sivaprakash,
Uniqueness of records is determined by the record key you specify to hudi. Hudi 
supports filtering out existing records (by record key). By default, it would 
upsert all incoming records. 
Please look at 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
 for information on how to dedupe records based on record key.

Balaji.V
On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash 
 wrote:  
 
 This might be a basic question - I'm experimenting with Hudi (Pyspark). I
have used Insert/Upsert options to write delta into my data lake. However,
one is not clear to me

Step 1:- I write 50 records
Step 2:- Im writing 50 records out of which only *10 records have been
changed* (I'm using upsert mode & tried with MERGE_ON_READ also
COPY_ON_WRITE)
Step 3: I was expecting only 10 records will be written but it writes whole
50 records is this a normal behaviour? Which means do I need to determine
the delta myself and write them alone?

Am I missing something?
  

Re: [DISCUSS] Make delete marker configurable?

2020-06-29 Thread Balaji Varadarajan
+1 


Sent from Yahoo Mail for iPhone


On Monday, June 29, 2020, 5:34 PM, Vinoth Chandar  wrote:

+1 as well. (sorry , for jumping in late)

On Sun, Jun 28, 2020 at 11:36 AM Shiyan Xu 
wrote:

> Thanks for the +1. Filed https://issues.apache.org/jira/browse/HUDI-1058
>
> On Sat, Jun 27, 2020 at 11:34 PM Pratyaksh Sharma 
> wrote:
>
> > The suggestion looks good to me as well.
> >
> > On Sun, Jun 28, 2020 at 8:17 AM Sivabalan  wrote:
> >
> > > +1, I just left it as a todo for future patch when I worked on it.
> > >
> > > On Sat, Jun 27, 2020 at 8:32 PM Bhavani Sudha  >
> > > wrote:
> > >
> > > > Hi Raymond,
> > > >
> > > > I am trying to understand  the use case . Can you please provide more
> > > > context on what problem this addresses ?
> > > >
> > > >
> > > > Thanks,
> > > > Sudha
> > > >
> > > > On Fri, Jun 26, 2020 at 9:02 PM Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > A small suggestion: as delta streamer relies on
> `_hoodie_is_deleted`
> > to
> > > > do
> > > > > hard delete, can we make it configurable? as in users can specify
> any
> > > > > boolean field for delete marker and `_hoodie_is_deleted` remains as
> > > > > default.
> > > > >
> > > > > Regards,
> > > > > Raymond
> > > > >
> > > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>





Re: How to extend the timeline server schema to accommodate business metadata

2020-05-31 Thread Balaji Varadarajan
 Hi Mario,
Timeline Server was designed to serve hudi metadata for Hudi writers and 
readers.  it may not be suitable to serve arbitrary data. But, it is an 
interesting thought. Can you elaborate more on what kind of business metadata 
are you looking. Is this something you are planning to store in commit files ? 
Balaji.V

On Sunday, May 31, 2020, 04:22:27 PM PDT, Mario de Sá Vera 
 wrote:  
 
 I see a need for extending the current timeline server schema so that a 
flexible model could be achieved in order to accommodate business metadata.

let me know if that makes sense to anyone here...

Regards,

Mario.
  

Re: hudi dependency conflicts for test

2020-05-20 Thread Balaji Varadarajan
 Thanks for using Hudi. Looking at pom definitions between 0.5.1 and 0.5.2, I 
don't see any difference that could cause this issue. As it works with 0.5.2, I 
am assuming you are not blocked. Let us know otherwise.
Balaji.VOn Wednesday, May 20, 2020, 01:17:08 PM PDT, Lian Jiang 
 wrote:  
 
 Thanks Vinoth.

Below dependency has no conflict:

compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.3.0'
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.3.0'
compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.11'
compile group: 'com.github.scopt', name: 'scopt_2.11', version: '3.7.1'
compile group: 'com.amazonaws', name: 'aws-java-sdk', version: '1.11.297'
compile group: 'org.apache.hudi', name: 'hudi-spark-bundle_2.11',
version: '0.5.2-incubating'
testCompile group: 'junit', name: 'junit', version: '4.12'
testCompile group: 'org.scalatest', name: 'scalatest_2.11', version:
'3.2.0-SNAP7'
testCompile group: 'org.mockito', name: 'mockito-scala_2.11', version: '1.5.12'
compile group: 'org.apache.iceberg', name: 'iceberg-api', version:
'0.8.0-incubating'

Cheers!


On Wed, May 20, 2020 at 5:00 AM Vinoth Chandar  wrote:

> Hi Leon,
>
> Sorry for the late reply.  Seems like a version mismatch for mockito..
> I see you are already trying to exclude it though.. Could you share the
> full stack trace?
>
>
>
>
> On Mon, May 18, 2020 at 1:12 PM Lian Jiang  wrote:
>
> > Hi,
> >
> > I am using hudi in a scala gradle project:
> >
> > dependencies {
> >    compile group: 'org.apache.spark', name: 'spark-core_2.11', version:
> > '2.4.4'
> >    compile group: 'org.apache.spark', name: 'spark-sql_2.11', version:
> > '2.4.4'
> >    compile group: 'org.scala-lang', name: 'scala-library', version:
> > '2.11.11'
> >    compile group: 'com.github.scopt', name: 'scopt_2.11', version:
> '3.7.1'
> >    compile group: 'org.apache.spark', name: 'spark-avro_2.11', version:
> > '2.4.4'
> >    compile group: 'com.amazonaws', name: 'aws-java-sdk', version:
> > '1.11.297'
> >    compile group: 'com.zillow.datacontracts', name:
> > 'contract-evaluation-library', version: '0.1.0.master.98a438b'
> >    compile (group: 'org.apache.hudi', name: 'hudi-spark_2.11',
> > version: '0.5.1-incubating') {
> >        exclude group: 'org.scala-lang', module: 'scala-library'
> >        exclude group: 'org.scalatest', module: 'scalatest_2.12'
> >    }
> >
> >    testCompile group: 'junit', name: 'junit', version: '4.12'
> >    testCompile group: 'org.scalatest', name: 'scalatest_2.11',
> > version: '3.2.0-SNAP7'
> >    testCompile group: 'org.mockito', name: 'mockito-scala_2.11',
> > version: '1.5.12'
> > }
> >
> > Below code throws exception '
> > java.lang.NoSuchMethodError:
> >
> >
> org.scalatest.mockito.MockitoSugar.$init$(Lorg/scalatest/mockito/MockitoSugar;)V'
> >
> > import org.junit.runner.RunWith
> > import org.scalatest.FunSuite
> > import org.scalatest.junit.JUnitRunner
> > import org.scalatest.mockito.MockitoSugar
> >
> > @RunWith(classOf[JUnitRunner])
> > class BaseTest extends FunSuite with MockitoSugar {
> > }
> >
> > Removing org.apache.hudi from the dependency list will make the code
> > work. Does anybody know how to include hudi dependency without
> > conflicting with the test?
> >
> > Appreciate any help!
> >
> > Regards
> >
> > Leon
> >
>


-- 

Create your own email signature

  

Re: Apache Hudi Graduation vote on general@incubator

2020-05-19 Thread Balaji Varadarajan
 Terrific job :) We are marching on !!
Balaji.V
On Tuesday, May 19, 2020, 05:16:57 PM PDT, Sivabalan  
wrote:  
 
 wow ! 19 binding votes. Great :)


On Tue, May 19, 2020 at 1:55 AM lamber-ken  wrote:

>
>
>
> Gread job! and good luck for apache hudi project.
>
>
>
>
> Best,
> Lamber-Ken
>
> At 2020-05-19 13:35:11, "Vinoth Chandar"  wrote:
> >Folks,
> >
> >the vote has passed!
> >
> https://lists.apache.org/thread.html/r86278a1a69bbf340fa028aca784869297bd20ab50a71f4006669cdb5%40%3Cgeneral.incubator.apache.org%3E
> >
> >
> >I will follow up with the next step [1], which is to submit the resolution
> >to the board.
> >
> >[1]
> >
> https://incubator.apache.org/guides/graduation.html#submission_of_the_resolution_to_the_board
> >
> >On Sun, May 17, 2020 at 7:14 PM 岳伟  wrote:
> >
> >> +1 Graduate Apache Hudi from the Incubator
> >>
> >>
> >>
> >>
> >> Harvey Yue
> >>
> >>
> >> On 05/16/2020 22:49,hamid pirahesh wrote:
> >> [x ] +1 Graduate Apache Hudi from the Incubator.>
> >>
> >> On Fri, May 15, 2020 at 7:06 PM Vinoth Chandar 
> wrote:
> >>
> >> Hello all,
> >>
> >> Just started the VOTE on the IPMC general list [1]
> >>
> >> If you are an IPMC member, you do a *binding *vote
> >> If you are not, you can still do a *non-binding* vote
> >>
> >> Please take a moment to vote.
> >>
> >> [1]
> >>
> >>
> >>
> https://lists.apache.org/thread.html/r8039c8eece636df8c81a24c26965f5c1556a3c6404de02912d6455b4%40%3Cgeneral.incubator.apache.org%3E
> >>
> >> Thanks
> >> Vinoth
> >>
> >>
>


-- 
Regards,
-Sivabalan  

Re: [VOTE] Apache Hudi graduation to top level project

2020-05-06 Thread Balaji Varadarajan
 +1
Balaji.V
On Wednesday, May 6, 2020, 10:36:14 PM PDT, tison  
wrote:  
 
 +1 Good luck!

Best,
tison.


Prasanna Rajaperumal  于2020年5月7日周四 下午12:59写道:

> +1
>
> On 2020/05/06 20:55:48, Vinoth Chandar  wrote:
> > Hello all,
> >
> > Per our discussion on the dev mailing list (
> >
> https://lists.apache.org/thread.html/rc98303d9f09665af90ab517ea0baeb7c374e9a5478d8424311e285cd%40%3Cdev.hudi.apache.org%3E
> > )
> >
> > I would like to call a VOTE for Apache Hudi graduating as a top level
> > project.
> >
> > If this vote passes, the next step would be to submit the resolution
> below
> > to the Incubator PMC, who would vote on sending it on to the Apache
> Board.
> >
> > Vote:
> > [ ] +1 - Recommend graduation of Apache Hudi as a TLP
> > [ ] -1 - Do not recommend graduation of Apache Hudi because...
> >
> > The VOTE is open for a minimum of 72 hours.
> >
> > Establish the Apache Hudi Project
> >
> > WHEREAS, the Board of Directors deems it to be in the best interests of
> the
> > Foundation and consistent with the Foundation's purpose to establish a
> > Project Management Committee charged with the creation and maintenance of
> > open-source software, for distribution at no charge to the public,
> related
> > to providing atomic upserts and incremental data streams on Big Data.
> >
> > NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC),
> > to be known as the "Apache Hudi Project", be and hereby is established
> > pursuant to Bylaws of the Foundation; and be it further
> >
> > RESOLVED, that the Apache Hudi Project be and hereby is responsible for
> the
> > creation and maintenance of software related to providing atomic upserts
> > and incremental data streams on Big Data; and be it further
> >
> > RESOLVED, that the office of "Vice President, Apache Hudi" be and hereby
> is
> > created, the person holding such office to serve at the direction of the
> > Board of Directors as the chair of the Apache Hudi Project, and to have
> > primary responsibility for management of the projects within the scope of
> > responsibility of the Apache Hudi Project; and be it further
> >
> > RESOLVED, that the persons listed immediately below be and hereby are
> appointed
> > to serve as the initial members of the Apache Hudi Project:
> >
> >  * Anbu Cheeralan                          
> >
> >  * Balaji Varadarajan                        
> >
> >  * Bhavani Sudha Saktheeswaran  
> >
> >  * Luciano Resende                        
> >
> >  * Nishith Agarwal                            
> >
> >  * Prasanna Rajaperumal                
> >
> >  * Shaofeng Li  
> >
> >  * Steve Blackmon                          
> >
> >  * Suneel Marthi                              
> >
> >  * Thomas Weise                
> >
> >  * Vino Yang                                  
> >
> >  * Vinoth Chandar        
> >
> > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Vinoth Chandar be appointed
> to
> > the office of Vice President, Apache Hudi, to serve in accordance with
> and
> > subject to the direction of the Board of Directors and the Bylaws of the
> > Foundation until death, resignation, retirement, removal of
> > disqualification, or until a successor is appointed; and
> >
> > be it further
> >
> > RESOLVED, that the Apache Hudi Project be and hereby is tasked with the
> > migration and rationalization of the Apache Incubator Hudi podling; and
> >
> > be it further
> >
> > RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Hudi
> > podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
> >
>  

Re: [DISCUSS] Next Release timeline

2020-04-26 Thread Balaji Varadarajan
+1 on Sudha being RM and targeting next release for mid may.

Balaji.V

On 2020/04/23 14:27:46, Vinoth Chandar  wrote: 
> Thanks all. Encourage everyone to chime in more, so we can make a decision
> here!
> 
> On Thu, Apr 23, 2020 at 6:29 AM Sivabalan  wrote:
> 
> > sounds good. We could go with a major by mid may.
> >
> > On Wed, Apr 22, 2020 at 12:58 PM Vinoth Chandar  wrote:
> >
> > > +1 on Sudha being the RM
> > >
> > > My preference would be to do a major release as well, targeting mid may
> > > (which means code freeze in 3 weeks?)
> > > This gives us enough time to land some major features as well as
> > stabilize
> > > them as much as possible.
> > >
> > > On Wed, Apr 22, 2020 at 3:21 AM Pratyaksh Sharma 
> > > wrote:
> > >
> > > > Major release looks good to me.
> > > >
> > > > On Wed, Apr 22, 2020 at 2:29 PM Bhavani Sudha  > >
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I wanted to kick start the discussion on timeline and logistics for
> > the
> > > > > next release. Here are couple things we need to figure out.
> > > > >
> > > > >1. Should the next release be a minor or major release?
> > > > >2. If its a minor release do we move the master back to 0.5.3 (
> > > > >currently the master is at 0.6.0-SNAPSHOT).
> > > > >3. Depending on minor or major release what is the timeline we
> > > should
> > > > >target?
> > > > >
> > > > > Here is my opinion:
> > > > > In addition to bug fixes, we have major features ( bootstrap, new
> > > indexes
> > > > > and bulk insert mode ) - either ready or almost ready. Hence, I
> > propose
> > > > we
> > > > > go with a major release. Assuming its a major release, may be mid of
> > > May
> > > > > might be a good timeline?
> > > > >
> > > > >
> > > > > I can volunteer to be the release manager if the community is okay
> > with
> > > > it.
> > > > > Please share your thoughts.
> > > > >
> > > > > Thanks,
> > > > > Sudha
> > > > >
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
> 


Re: [DISCUSS] Bug bash?

2020-04-22 Thread Balaji Varadarajan
 +1. Would also be great if folks sign-up for testing/trying out the master 
branch in their real environments 
On Wednesday, April 22, 2020, 02:48:13 PM PDT, Bhavani Sudha 
 wrote:  
 
 +1 Sounds like a good idea

On Wed, Apr 22, 2020 at 1:51 PM Vinoth Chandar  wrote:

> Just floating a very random idea here. :)
>
> Would there be interest in doing a bug bash for a week, where we
> aggressively close out some pesky bugs that have been lingering around.. If
> enough committers and contributors are around, we can move the needle. We
> could time this a week before cutting RC for next release.
>
> Thanks
> Vinoth
>
  

Re: [DISCUSS] Support popular metrics reporter

2020-04-22 Thread Balaji Varadarajan
 +1 
On Wednesday, April 22, 2020, 08:35:30 AM PDT, leesf  
wrote:  
 
 +1

Vinoth Chandar  于2020年4月22日周三 下午2:24写道:

> +1 from me as well
>
> On Mon, Apr 20, 2020 at 9:37 PM vino yang  wrote:
>
> > Hi Raymond,
> >
> > Thanks for opening this discussion.
> >
> > IMHO, as Hudi's user base grows, we need to enhance our metrics reporter.
> > From an ecological point of view, this is also very important.
> >
> > So, +1 from my side.
> >
> > Best,
> > Vino
> >
> > Shiyan Xu  于2020年4月21日周二 上午10:59写道:
> >
> > > Hi all,
> > >
> > > I'd like raise the topic of supporting multiple metrics reporters.
> > >
> > > Currently hudi supports graphite and JMX. And there are 2 proposed
> > reporter
> > > types: CSV and Prometheus
> > > https://jira.apache.org/jira/browse/HUDI-210
> > > https://jira.apache.org/jira/browse/HUDI-361
> > >
> > > I think supporting multiple metrics backends gives Hudi competitive
> > > advantage on user expansion. It reduces the friction for different
> > > organizations to adopt Hudi. And we only need to support a few popular
> > ones
> > > to achieve that.
> > >
> > > In terms of determining the list, as mentioned by @vinoyang, flink has
> a
> > > nice list of supported ones:
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter
> > > which can be used as a reference.
> > >
> > > From that list, I'd like to propose supporting Datadog as well, due to
> > its
> > > popularity. May I get +1 on this?
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Raymond
> > >
> >
>  

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-16 Thread Balaji Varadarajan
 
>A new file slice (empty parquet) is indeed generated for every file group
in a partition.
>> we could just reuse the existing file groups right? probably is bit
hacky...
Sorry for the confusion. I meant to say the empty file slice is only for 
file-groups which does not have any incoming records assigned. This is for the 
case when we have fewer incoming records to fit into all existing file-groups. 
Existing file groups will be reused. 
Agree, on the magic part.
Balaji.VOn Thursday, April 16, 2020, 11:11:06 AM PDT, Vinoth Chandar 
 wrote:  
 
 >A new file slice (empty parquet) is indeed generated for every file group
in a partition.
we could just reuse the existing file groups right? probably is bit
hacky...

>we can encode some MAGIC in the write-token component for Hudi readers to
skip these files so that they can be safely removed.
This kind of MAGIC worries me :) ..  if it comes to that, I suggest, lets
get a version of metadata management along lines of RFC-15/timeline server
going before implementing this.

On Thu, Apr 16, 2020 at 10:55 AM vbal...@apache.org 
wrote:

>  Satish,
> Thanks for the proposal. I think a RFC would be useful here. Let me know
> your thoughts. It would be good to nail other details like whether/how to
> deal with external index management with this API.
> Thanks,Balaji.V
>    On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan
>  wrote:
>
>
> +1 from me. This is a really cool feature.
> Yes, A new file slice (empty parquet) is indeed generated for every file
> group in a partition.
> Regarding cleaning these "empty" file slices eventually by cleaner (to
> avoid cases where there are too many of them lying around) in a safe way,
> we can encode some MAGIC in the write-token component for Hudi readers to
> skip these files so that they can be safely removed.
> For metadata management, I think it would be useful to distinguish between
> this API and other insert APIs. At the very least, we would need a
> different operation type which can be achieved with same API (with flags).
> Balaji.V
>
>    On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hi Satish,
>
> Thanks for starting this..  Your use-cases do sounds very valuable to
> support. So +1 from me.
>
> IIUC, you are implementing a partition level overwrite, where existing
> filegroups will be retained, but instead of merging, you will just reuse
> the file names and write the incoming records into new file slices?
> You probably already thought of this, but one thing to watch out for is :
> we should generate a new file slice for every file group in a partition..
> Otherwise, old data will be visible to queries.
>
> if so, that makes sense to me.  We can discuss more on whether we can
> extend the bulk_insert() API with additional flags instead of a new
> insertOverwrite() API..
>
> Others, thoughts?
>
> Thanks
> Vinoth
>
> On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha  >
> wrote:
>
> > Hello
> >
> > I want to discuss adding a new high level API 'insertOverwrite' on
> > HoodieWriteClient. This API can be used to
> >
> >    -
> >
> >    Overwrite specific partitions with new records
> >    -
> >
> >      Example: partition has  'x' records. If insert overwrite is done
> with
> >      'y' records on that partition, the partition will have just 'y'
> > records (as
> >      opposed to  'x union y' with upsert)
> >      -
> >
> >    Overwrite entire table with new records
> >    -
> >
> >      Overwrite all partitions in the table
> >
> > Usecases:
> >
> > - Tables where the majority of records change every cycle. So it is
> likely
> > efficient to write new data instead of doing upserts.
> >
> > -  Operational tasks to fix a specific corrupted partition. We can do
> > 'insert overwrite'  on that partition with records from the source. This
> > can be much faster than restore and replay for some data sources.
> >
> > The functionality will be similar to hive definition of 'insert
> overwite'.
> > But, doing this in Hoodie will provide better isolation between writer
> and
> > readers. I can share possible implementation choices and some nuances if
> > the community thinks this is a useful feature to add.
> >
> >
> > Appreciate any feedback.
> >
> >
> > Thanks
> >
> > Satish
> >
>
  

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-16 Thread Balaji Varadarajan
 
+1 from me. This is a really cool feature. 
Yes, A new file slice (empty parquet) is indeed generated for every file group 
in a partition. 
Regarding cleaning these "empty" file slices eventually by cleaner (to avoid 
cases where there are too many of them lying around) in a safe way, we can 
encode some MAGIC in the write-token component for Hudi readers to skip these 
files so that they can be safely removed. 
For metadata management, I think it would be useful to distinguish between this 
API and other insert APIs. At the very least, we would need a different 
operation type which can be achieved with same API (with flags).
Balaji.V

On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar 
 wrote:  
 
 Hi Satish,

Thanks for starting this..  Your use-cases do sounds very valuable to
support. So +1 from me.

IIUC, you are implementing a partition level overwrite, where existing
filegroups will be retained, but instead of merging, you will just reuse
the file names and write the incoming records into new file slices?
You probably already thought of this, but one thing to watch out for is :
we should generate a new file slice for every file group in a partition..
Otherwise, old data will be visible to queries.

if so, that makes sense to me.  We can discuss more on whether we can
extend the bulk_insert() API with additional flags instead of a new
insertOverwrite() API..

Others, thoughts?

Thanks
Vinoth

On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha 
wrote:

> Hello
>
> I want to discuss adding a new high level API 'insertOverwrite' on
> HoodieWriteClient. This API can be used to
>
>    -
>
>    Overwrite specific partitions with new records
>    -
>
>      Example: partition has  'x' records. If insert overwrite is done with
>      'y' records on that partition, the partition will have just 'y'
> records (as
>      opposed to  'x union y' with upsert)
>      -
>
>    Overwrite entire table with new records
>    -
>
>      Overwrite all partitions in the table
>
> Usecases:
>
> - Tables where the majority of records change every cycle. So it is likely
> efficient to write new data instead of doing upserts.
>
> -  Operational tasks to fix a specific corrupted partition. We can do
> 'insert overwrite'  on that partition with records from the source. This
> can be much faster than restore and replay for some data sources.
>
> The functionality will be similar to hive definition of 'insert overwite'.
> But, doing this in Hoodie will provide better isolation between writer and
> readers. I can share possible implementation choices and some nuances if
> the community thinks this is a useful feature to add.
>
>
> Appreciate any feedback.
>
>
> Thanks
>
> Satish
>
  

Re: New PPMC Member : Bhavani Sudha

2020-04-07 Thread Balaji Varadarajan
 Congratulations Sudha :) Well deserved.  Welcome to PPMC. 
Balaji.V

On Tuesday, April 7, 2020, 03:04:37 PM PDT, Gary Li 
 wrote:  
 
 Congrats Sudha! Appreciated all the work you have done!

On Tue, Apr 7, 2020 at 2:57 PM Y Ethan Guo  wrote:

> Congrats!!!
>
> On Tue, Apr 7, 2020 at 2:55 PM Vinoth Chandar  wrote:
>
> > Hello all,
> >
> > I am very excited to share that we have new PPMC member - Sudha. She has
> > been a great champion for the project for almost couple years now,
> driving
> > a lot of presto/query engine facing changes and most of all being the
> face
> > of our community to new users on Slack, over the past few months.
> >
> > Please join me in congratulating her!
> >
> > On behalf of Hudi PPMC,
> > Vinoth
> >
>
  

Re: New Committer: lamber-ken

2020-04-07 Thread Balaji Varadarajan
 Many Congratulations Lamber-Ken.  Well deserved !!
Balaji.V
On Tuesday, April 7, 2020, 02:23:51 PM PDT, Y Ethan Guo 
 wrote:  
 
 Congrats!!!

On Tue, Apr 7, 2020 at 2:22 PM Gary Li  wrote:

> Congrats lamber! Well deserved!
>
> On Tue, Apr 7, 2020 at 2:18 PM Vinoth Chandar  wrote:
>
> > Hello Apache Hudi Community,
> >
> > The Podling Project Management Committee (PPMC) for Apache
> > Hudi (Incubating) has invited lamber-ken (Xie Lei) to become a committer
> > and we are pleased to announce that he has accepted.
> >
> > lamber-ken has had a large impact by in hudi, with some sustained efforts
> > in the past several months. He has rebuilt our site ground up, automated
> > doc workflows, helped fixed a lot of bugs and also been super helpful for
> > the community at large.
> >
> > Congratulations lamber-ken !! Please join me in recognizing his efforts!
> >
> > On behalf of PPMC,
> > Vinoth
> >
>
  

Re: [DISSCUSS] Troubleshooting flow

2020-04-06 Thread Balaji Varadarajan
 Agree. The triaging process makes sense to me.
Balaji.V
On Monday, April 6, 2020, 09:54:24 AM PDT, Vinoth Chandar 
 wrote:  
 
 Hi,

I feel there are couple of action items here..

a) JIRA to track work for slack-ML integration
b) Document the support triaging process : Slack (level 1) -> Github Issues
(level 2 , triage, root cause) -> JIRA (level 3, file bug, get resolution)
.

P.S: Mailing List is very similar to Slack as well IMO.. i.e mostly level 1
things (w.r.t to triaging issues). Do you all agree?

Thanks
Vinoth

On Sat, Apr 4, 2020 at 3:03 AM leesf  wrote:

> Sorry to chime in so late, in fact we did discussion integrate slack with
> dev ML before [1], but seems like it needs some other work before working,
> in order to reduce repetitive workload, I am +1 to move some debugging
> question to GH issues, which could be easily searched.
>
> [1]
>
> https://lists.apache.org/thread.html/r0575d916663f826a5078363ec913c53360afb372471061aa60fd380c%40%3Cdev.hudi.apache.org%3E
>
> lamber-ken  于2020年4月4日周六 上午12:47写道:
>
> >
> >
> > Thanks you all,
> >
> >
> > Agree with Sudha, it's ok to answer simple questions and move debugging
> > type of questions to GH issues.
> > So, let's try to guide users who asking debugging questions to use GH
> > issues if possible.
> >
> >
> > Thanks,
> > Lamber-Ken
> >
> >
> >
> >
> >
> > At 2020-04-03 07:19:26, "Bhavani Sudha"  wrote:
> > >Also one thing I wanted to note. I feel it should be okay to answer
> simple
> > >`what does this mean` type of questions in slack and move debugging type
> > of
> > >questions to GH issues. What do you all think?
> > >
> > >Thanks,
> > >Sudha
> > >
> > >On Thu, Apr 2, 2020 at 11:45 AM Bhavani Sudha 
> > >wrote:
> > >
> > >> Agree on using GH issues to post code snippets or debugging issues.
> > >>
> > >> Regarding mirroring slack to commits, the last time I checked there
> was
> > no
> > >> options that was readily available ( there were one or two paid
> > products).
> > >> It looked like we can possibly develop our own IFTT/ web hook on
> slack.
> > Not
> > >> sure how much of work that is.
> > >>
> > >>
> > >> Thanks,
> > >> Sudha
> > >>
> > >>
> > >> On Thu, Apr 2, 2020 at 8:40 AM Vinoth Chandar 
> > wrote:
> > >>
> > >>> Hello all,
> > >>>
> > >>> Actually that's how we have been using GH issues.. Both slack/ml are
> > >>> inconvenient for sharing code and having long threaded conversations.
> > >>> (same
> > >>> issues raised here).
> > >>>
> > >>> That said, we could definitely formalize this and look to move slack
> > >>> threads into GH issue for triaging (then follow up with JIRA, if real
> > bug)
> > >>> before they get too long.
> > >>>
> > >>> >>slack has some answerbot to auto reply and promote users to create
> GH
> > >>> issues.
> > >>> Worth looking into.. There was also a conversation around mirroring
> > >>> #general into commits or something for indexing/searching.. ?
> > >>>
> > >>>
> > >>> On Thu, Apr 2, 2020 at 1:36 AM vino yang 
> wrote:
> > >>>
> > >>> > Hi Lamber-Ken,
> > >>> >
> > >>> > Thanks for rasing this problem.
> > >>> >
> > >>> > >> 3. threads cann't be indexed by search engines
> > >>> >
> > >>> > Yes, I always thought that it would be better to have a "users" ML,
> > but
> > >>> it
> > >>> > is not clear whether only the Top-Level Project can have this ML.
> > >>> >
> > >>> > Best,
> > >>> > Vino
> > >>> >
> > >>> >
> > >>> > Shiyan Xu  于2020年4月1日周三 上午4:54写道:
> > >>> >
> > >>> > > Good idea to use GH issues as triage.
> > >>> > >
> > >>> > > Not sure if slack has some answerbot to auto reply and promote
> > users
> > >>> to
> > >>> > > create GH issues. If it can be configured that way, that'd be
> great
> > >>> for
> > >>> > > this purpose :)
> > >>> > >
> > >>> > > On Tue, 31 Mar 2020, 10:03 lamberken,  wrote:
> > >>> > >
> > >>> > > > Hi team,
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > Many users use slack ask for support when they met bugs /
> > problems
> > >>> > > > currently.
> > >>> > > >
> > >>> > > > but there are some disadvantages we need to consider:
> > >>> > > >
> > >>> > > > 1. code snippet display is not friendly.
> > >>> > > >
> > >>> > > > 2. we may miss some questions when questions come up at the
> same
> > >>> time.
> > >>> > > >
> > >>> > > > 3. threads cann't be indexed by search engines
> > >>> > > >
> > >>> > > > ...
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > So, I suggest we should guide users to use GitHub issues as
> much
> > as
> > >>> we
> > >>> > > can.
> > >>> > > >
> > >>> > > > step1: guide users use GitHub issues to report their questions
> > >>> > > >
> > >>> > > > step2: developers can pick up some issues which they are
> > interested
> > >>> in.
> > >>> > > >
> > >>> > > > step3: raise a related JIRA if needed
> > >>> > > >
> > >>> > > > step4: add some useful notes to troubleshooting guide
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > Any thoughts are welcome, thanks : )
> > >>> > > >
> > >>> > > 

Re: Query regarding restoring HUDI tables to older commits

2020-03-22 Thread Balaji Varadarajan
 Vinoth,
Yes, I agree. Reverting completed operations when writers are stopped is safe. 
Balaji.V
On Saturday, March 21, 2020, 08:04:10 PM PDT, Vinoth Chandar 
 wrote:  
 
 Hi all,

Good discussion. let me try and tease this apart.

Rollback. : Should only be used for rolling back an inflight write..
Nothing else IMO.. This is where we guarantee that there will be no impact
to readers/query engines.

Restore : It's an invasive maintenance operation, that will be disruptive
to queries that are currently running..

To Prashant's point, I think it will be cleaner to restore the timeline to
not have any actions > the restored instant time?  Note that with MOR, we
may have logged data blocks belonging to multiple instant into the same log
file and we may have to log additional rollback blocks?

@balaji , if we mandate ingest job be stopped/bounced during restore
anyway, I think it should be safe right? We have a clean log based design
where the cleaner will just work off what's in the timeline and reach the
same state again (well, not same to same, but equivalent, since input could
be larger/different)..

If you all agree, can we may be talk about gaps in our implementation
around restores today?

Thanks
Vinoth










On Wed, Mar 18, 2020 at 12:21 PM Balajee Nagasubramaniam
 wrote:

> Hi Prashant,
>
> Regarding clean vs rollback/restoreToInstant, if you think of all the
> commits/datafiles in the active timeline as a queue of items,
> rollback/restoreToInstant would be working on the head of the queue whereas
> clean would be working on the tail of the queue. They should be treated as
> two independent operations on the queue. At datafile/file-slice level, if
> cleaner is configured to maintain 3 versions of the file, then you can
> rollback at most 2 recent versions. Hope this helps.
>
> Thanks,
> Balajee
>
> On Wed, Mar 18, 2020 at 11:54 AM Prashant Wason 
> wrote:
>
> > Thanks for the info Vinoth / Balaji.
> >
> > To me it feels a split between easier-to-understand design and
> > current-implementation. I feel it is simpler to reason (based on how file
> > systems work in general) that restoreToInstant is a complete
> point-in-time
> > shift to the past (like restoring a file system from a snapshot/backup).
> >
> > If I have restored the Table to commitTime=005, then having any instants
> > with commitTime > 005 are confusing as it implies that even though my
> table
> > is at an older time, some future operations will be applied onto it at
> some
> > point.
> >
> > I will have to read more about incremental timeline syncing and timeline
> > server to understand how it uses the clean instants. BTW, the comment on
> > the function HoodieWriteClient::restoreToInstant reads "NOTE : This
> action
> > requires all writers (ingest and compact) to a table to be stopped before
> > proceeding". So probably the embedded timeline server can recreate the
> view
> > next time it comes back up?
> >
> > Thanks
> > Prashant
> >
> >
> > On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
> >  wrote:
> >
> > >  Prashanth,
> > > I think we should not be reverting clean operations here. Cleans are
> done
> > > on the oldest file slices and a restore/rollback is not completely
> > undoing
> > > the work of clean that happened before it.
> > > For incremental timeline syncing, embedded timeline server needs to
> read
> > > these clean metadata to sync its cached file-system view.
> > > Let me know your thoughts.
> > > Balaji.V
> > >    On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> > >  wrote:
> > >
> > >  HI Team,
> > >
> > > I noticed that when a table is restored to a previous commit (
> > > HoodieWriteClient::restoreToInstant
> > > <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw=
> > > >),
> > > only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back
> and
> > > their corresponding files are deleted from the timeline. If there are
> > some
> > > CLEAN instants, they are left over.
> > >
> > > Is there a reason why CLEAN are not removed? Won't they be referring to
> > > files  which are no longer present and hence not useful?
> > >
> > > Thanks
> > > Prashant
> > >
> >
>  

Re: Could not load key generator class org.apache.hudi.ComplexKeyGenerator

2020-03-21 Thread Balaji Varadarajan
 With 0.5.1, the key-generator classes are relocated to  org.apache.hudi.keygen.
You can find the information in release notes in 
https://hudi.incubator.apache.org/releases.html#release-051-incubating-docs
Balaji.VOn Saturday, March 21, 2020, 01:47:48 PM PDT, FO O 
 wrote:  
 
 Hi,

When trying to use a ComplexKeyGenerator with 0.5.1 I get the following
error:
"Exception in thread "main" java.io.IOException: Could not load key
generator class org.apache.hudi.ComplexKeyGenerator"

https://gist.github.com/fo3310001/5a998b73a95f734b2852fc1c8689dd62


The command I am using is passing the hudi-utilities-bundle:

 spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  --packages
org.apache.spark:spark-avro_2.11:2.4.4  --master yarn --deploy-mode client
 /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.1-incubating.jar (...)

Any pointers would be appreciated.

Thank you!
  

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Balaji Varadarajan
 Prashanth,
I think we should not be reverting clean operations here. Cleans are done on 
the oldest file slices and a restore/rollback is not completely undoing the 
work of clean that happened before it. 
For incremental timeline syncing, embedded timeline server needs to read these 
clean metadata to sync its cached file-system view.
Let me know your thoughts.
Balaji.V
On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason 
 wrote:  
 
 HI Team,

I noticed that when a table is restored to a previous commit (
HoodieWriteClient::restoreToInstant
),
only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
their corresponding files are deleted from the timeline. If there are some
CLEAN instants, they are left over.

Is there a reason why CLEAN are not removed? Won't they be referring to
files  which are no longer present and hence not useful?

Thanks
Prashant
  

Re: [DISCUSS] Restructure hudi-utilities module

2020-03-09 Thread Balaji Varadarajan
 +1 on Vinoth's suggestion on waiting for the lower level (write-client) 
re-factored and re-organized first.  We can then look at Data-Source and 
DeltaStreamer to make sure how to best organize them. 
Balaji.VOn Sunday, March 8, 2020, 11:06:13 PM PDT, Vinoth Chandar 
 wrote:  
 
 >> make delta streamer a engine agnostic part so that Spark and Flink can
share some common logic.

If we make the change at the Write Client level to make it engine agnostic,
it should help with most of the cases.. I believe there will be spark
specific pieces in the Source abstraction since those are using spark
datasources underneath in some cases..  My opinion is that we can first
focus our efforts on making hudi-client agnostic and pluggable with
different engines.. We can tackle deltastreamer down the line once we have
it..

On Wed, Mar 4, 2020 at 6:51 PM vino yang  wrote:

> Hi guys,
>
> My original thought is to make delta streamer a engine agnostic part so
> that Spark and Flink can share some common logic.
>
> >>I am not sure the ROI is there for renaming to hudi-deltastreamer  and
> pull this out.. Everytime we change a module name
>
> Actually, here my suggestion is to move the delta streamer to another new
> module and keep the current hudi-utilities module. Although, in a way,
> moving classes are similar to rename the module name.
>
> >> I propose we leave this module to be spark specific, i.e depending on
> hudi-spark alone
>
> OK, will think to build delta streaming mode via Flink and ignore the
> current implementation of delta streamer.
>
> Best,
> Vino
>
> Vinoth Chandar  于2020年3月5日周四 上午12:47写道:
>
> > I am not sure the ROI is there for renaming to hudi-deltastreamer  and
> pull
> > this out.. Everytime we change a module name, its a breaking change and I
> > would prefer if we reserved those for really pressing issues.. or take
> > natural course of development and get there..
> >
> > Regarding how multi framework support would affect this module, I propose
> > we leave this module to be spark specific, i.e depending on hudi-spark
> > alone.. Until, we can make flink work end-end.
> > This feels kind of premature to me.
> >
> > On Wed, Mar 4, 2020 at 8:37 AM Gary Li  wrote:
> >
> > > +1. hudi-delta gives me the feeling that it has something to do with
> > other
> > > frameworks... I’d vote for another name hudi-deltastreamer or
> > hudi-streamer
> > > or hudi-stream.
> > >
> > > On Wed, Mar 4, 2020 at 2:29 AM vino yang 
> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > Currently, it seems the content of hudi-utilities looks a bit mix.
> > > > Summarize all of them, there are two aspects list below:
> > > >
> > > >
> > > >    - delta streamer and its relevant packages, e.g. deltastreamer,
> > > sources,
> > > >    schema, transform, these packages are served for delta streamer.
> > > >    - Some utility tools such as
> > > >    HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on
> > > >
> > > >
> > > > We are trying to refactor the computing engine relevant business
> logic.
> > > > Delta Streamer (especially, the sources package is a start point of a
> > job
> > > > of Spark/Flink) will be affected. Doing this restructure can make the
> > > work
> > > > more clear and focus.
> > > >
> > > > I would like to start a proposal to restructure the hudi-utilites
> > module.
> > > > Considering delta streamer is a great feature for hudi, the logic is
> > very
> > > > much in the hudi-utilites. Can we raise its importance via making the
> > > delta
> > > > streamer as a single module? It could be named e.g. hudi-delta or
> > > something
> > > > else. Then let the hudi-utilities be a real utilities module to host
> > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.
> > > >
> > > > In short, we can do these restructure works:
> > > >
> > > >
> > > >    - create a new module, named “hudi-delta” (or other name?) and
> move
> > > the
> > > >    deltastreamer, sources, schema, transform … packages into this
> > module
> > > >    - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner …
> in
> > > the
> > > >    current place (utilities module)
> > > >
> > > > What do you think?
> > > >
> > > > Any comments and suggestions are welcome and appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > >
> >
>  

Re: [ANNOUNCE] Code is frozen for next release(0.5.2)

2020-02-29 Thread Balaji Varadarajan

+1 on cutting the branch. 
Vino, let us know in this thread if you run into any problems in the release 
process.
Balaji. V
Sent from Yahoo Mail for iPhone


On Saturday, February 29, 2020, 9:19 AM, Vinoth Chandar  
wrote:

Great!  Can we cut the release candidate branch 0.5.2 right away so that
master PRs can go ahead and merge?

On Sat, Feb 29, 2020 at 1:55 AM vino yang  wrote:

> Hi all,
>
> Based on our previously agreed conclusions on the mailing list[1]. It's
> time to freeze the code.
>
> I hereby inform you that the code is frozen now.
>
> In this release our focus is on addressing related Apache compliance
> issues. So far, there are still some issues(license and notice) to be
> resolved, and we will focus on solving them after the code freezes.
>
> And 0.5.2-RC1 will be sent in the next few days, thank you.
>
> Best,
> Vino
>
> [1]:
>
> https://lists.apache.org/thread.html/rfe3d4c9d89e9501b3d2993955a99d923081060d53a7b9d07c0843f7d%40%3Cdev.hudi.apache.org%3E
>





Re: Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-27 Thread Balaji Varadarajan

Awesome Pratyaksh, would you mind opening a PR to documenting it.
Balaji.V

Sent from Yahoo Mail for iPhone


On Wednesday, February 26, 2020, 11:14 PM, Pratyaksh Sharma 
 wrote:

Hi,

I figured out the issue yesterday. Thank you for helping me out.

On Thu, Feb 27, 2020 at 12:01 AM vbal...@apache.org 
wrote:

>
> This change was done as part of adding delete API support :
> https://github.com/apache/incubator-hudi/commit/7031445eb3cae5a4557786c7eb080944320609aa
>
> I don't remember the reason behind this.
> Sivabalan, Can you explain the reason when you get a chance.
> Thanks,Balaji.V
>    On Wednesday, February 26, 2020, 06:03:53 AM PST, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Anybody got a chance to look at this?
>
> On Mon, Feb 24, 2020 at 1:04 AM Pratyaksh Sharma 
> wrote:
>
> > Hi,
> >
> > While working on one of my PRs, I am stuck with the following test cases
> > in TestHoodieDeltaStreamer -
> > 1. testUpsertsCOWContinuousMode
> > 2. testUpsertsMORContinuousMode
> >
> > For both of them, at line [1] and [2], we are adding 200 to totalRecords
> > while asserting record count and distance count respectively. I am unable
> > to understand what do these 200 records correspond to. Any leads are
> > appreciated.
> >
> > I feel probably I am missing some piece of code where I need to do
> changes
> > for the above tests to pass.
> >
> > [1]
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L425
> > .
> > [2]
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L426
> > .
> >
> >
>





Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-25 Thread Balaji Varadarajan
+1. Lets do it :)

Balaji.V

On Mon, Feb 24, 2020 at 6:36 PM Shiyan Xu 
wrote:

> +1 great reading and values!
>
> On Mon, 24 Feb 2020, 15:31 nishith agarwal,  wrote:
>
> > +100
> > - Reduces index lookup time hence improves job runtime
> > - Paves the way for streaming style ingestion
> > - Eliminates dependency on Hbase (alternate "global index" support at the
> > moment)
> >
> > -Nishith
> >
> > On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar 
> wrote:
> >
> > > +1 from me as well. This will be a product defining feature, if we can
> do
> > > it/
> > >
> > > On Sun, Feb 23, 2020 at 6:27 PM vino yang 
> wrote:
> > >
> > > > Hi Sivabalan,
> > > >
> > > > Thanks for your proposal.
> > > >
> > > > Big +1 from my side, indexing for record granularity is really good
> for
> > > > performance. It is also towards the streaming processing.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Sivabalan  于2020年2月23日周日 上午12:52写道:
> > > >
> > > > > As Aapche Hudi is getting widely adopted, performance has become
> the
> > > need
> > > > > of the hour. This RFC focusses on improving performance of the Hudi
> > > index
> > > > > by introducing record level index. The proposal is to implement a
> new
> > > > index
> > > > > format that is a mapping of (recordKey <-> partition, fileId) or
> > > > > ((recordKey, partitionPath) → fileId). This mapping will be stored
> > and
> > > > > maintained by Hudi as another implementation of HoodieIndex. This
> > > record
> > > > > level indexing will definitely give a boost to both read and write
> > > > > performance.
> > > > >
> > > > > Here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > > > > >
> > > > > is the link to RFC.
> > > > >
> > > > > Appreciate your review and thoughts.
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Support for complex record keys with TimestampBasedKeyGenerator

2020-02-25 Thread Balaji Varadarajan
 
See if you can have a generic implementation where individual fields in the 
partition-path can be configured with their own key-generator class. Currently, 
TimestampBasedKeyGenerator is the only type specific custom generator. If we 
are anticipating more such classes for specialized types, you can use a generic 
way to support overriding key-generator for individual partition-fields once 
and for all.
Balaji.VOn Monday, February 24, 2020, 03:09:02 AM PST, Pratyaksh Sharma 
 wrote:  
 
 Hi,

We have TimestampBasedKeyGenerator for defining custom partition paths and
we have ComplexKeyGenerator for supporting having combination of fields as
record key or partition key.

However we do not have support for the case where one wants to have
combination of fields as record key along with being able to define custom
partition paths. This use case recently came up at my organisation.

How about having CustomTimestampBasedKeyGenerator which supports the above
use case? This class can simply extend TimestampBasedKeyGenerator and allow
users to have combination of fields as record key.

Open to hearing others' opinions.
  

Re: updatePartitionsToTable() is time consuming and redundant.

2020-02-16 Thread Balaji Varadarajan
 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181203, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181202, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181201, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Regards,
Purushotham Pushpavanth



On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar  wrote:

> Unfortunately, the mailing list does not support attachments, looks like :(
> Could you paste it inline?
>
> On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> pushpavant...@gmail.com> wrote:
>
> > Hi Balaji,
> >
> > The attachment contains the logs you asked for.
> > However, the only difference between storageValue and
> > fullStoragePartitionPath is *target-base-path*.
> > So if I'm not wrong, the code will be marking all partitions which got
> > UPDATE data for partition update. Hence time consuming.
> >
> > Regards,
> > Purushotham Pushpavanth
> >
> >
> >
> > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> >  wrote:
> >
> >>  Hi Purushotham,
> >> I am unable to reproduce same  partitions getting hive-synced locally.
> >> Can you add the following log message in HoodieHiveClient.java and run
> the
> >> code and send us logs.
> >> diff --git
> >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> index 4578bb2f..ba4b1147 100644
> >>
> >> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> >>
> >>          if (!paths.containsKey(storageValue)) {
> >>
> >>
> >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> >>
> >>          } else if
> >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> >>
> >> +          LOG.info("Partition Location changes. StorageVal=" +
> >> storageValue
> >>
> >> +              + ", Existing Hive Path=" + paths.get(storageValue) + ",
> >> New Location=" + fullStoragePartitionPath);
> >>
> >>
> >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> >>
> >>          }
> >>
> >>        }
> >>
> >> THanks,Balaji.V
> >>    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> >> Pushpavanthar  wrote:
> >>
> >>  Hi,
> >>
> >> I noticed that
> >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> time
> >> consuming while running HUDI on set of records which contains data for
> >> large set of partitions. All it is doing is setting location for each
> >> updated partition path. However,
> >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> >> *is taking care of adding new partitions to the table.
> >>
> >>  1. For a given table, whose base path doesn't change (usually it
> doesn't
> >>  in production), why *updatePartitionsToTable() *is needed? Can you
> >>  please throw some light on any such case where this is needed?
> >>  2. If it is required, can we do something to optimise the time
> consumed
> >>  by this operation? Currently, the *Alter Statements* are executed one
> by
> >>  one on each (partition, path) pair for every updated partition.
> >>
> >>
> >>
> >> Regards,
> >> Purushotham Pushpavanth
> >>
> >
> >
>
  

Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread Balaji Varadarajan
 +1 as well. Looks great.
Balaji.V
On Thursday, January 23, 2020, 08:17:47 AM PST, Vinoth Chandar 
 wrote:  
 
 Looks good . +1 !

On Wed, Jan 22, 2020 at 11:44 PM lamberken  wrote:

>
>
> Hello everyone,
>
>
> I redrawed the hudi data lake architecture diagram on landing page. If you
> have time, go ahead with hudi website[1] and test site[2].
> Any thoughts are welcome, thanks very much. :)
>
>
> [1] https://hudi.apache.org
> [2] https://lamber-ken.github.io
>
>
> Thanks
> Lamber-Ken
  

Re: [VOTE] Release 0.5.1-incubating, release candidate #1

2020-01-22 Thread Balaji Varadarajan
+1 (binding)

Ran the following validation steps:

1. Checked out RC candidate source code and compiled successfully
2. Ran Apache Hudi quickstart steps successfully on 0.5.1-rc1
3. Ran Long running deltastreamer test for a half day without any
exceptions.
4. Compliance : Ran "./release/validate_staged_release.sh --release=0.5.1
--rc_num=1" successfully

Checking Checksum of Source Release
Checksum Check of Source Release - [OK]
Checking Signature
  Signature Check - [OK]
Checking for binary files in source release
   No Binary Files in Source Release? - [OK]
Checking for DISCLAIMERi-WIP
   DISCLAIMER file exists ? [OK]
Checking for LICENSE and NOTICE
   License file exists ? [OK]
   Notice file exists ? [OK]
Performing custom Licensing Check
   Licensing Check Passed [OK]
Running RAT Check
   RAT Check Passed [OK]

Thanks,
Balaji.V


On Wed, Jan 22, 2020 at 11:20 AM leesf  wrote:

> Hi everyone,
>
> We have prepared the second apache release candidate for Apache Hudi
> (incubating). The version is : 0.5.1-incubating-rc1. Please review and vote
> on the release candidate #1 for the version 0.5.1, as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 623E08E06DB376684FB9599A3F5953147903948A [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "release-0.5.1-incubating-rc1" [5],
>
>
>
> The vote will be open for at least 72 hours.
> Please cast your votes before *Jan. 27th 2020, 16:00 UTC*.
>
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
>
> Thanks,
> Leesf
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12346183
> [2]
>
> https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.1-incubating-rc1/
> [3] https://dist.apache.org/repos/dist/dev/incubator/hudi//KEYS
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1014
> [5]
> https://github.com/apache/incubator-hudi/tree/release-0.5.1-incubating-rc1
>


Re: Would not Stage source releases on dist.apache.org

2020-01-20 Thread Balaji Varadarajan
 Awesome. Let me know if you need anything else.
Balaji.V
On Monday, January 20, 2020, 11:32:08 PM PST, leesf  
wrote:  
 
 Works after using  *svn
checkout https://dist.apache.org/repos/dist/dev/incubator/hudi
<https://dist.apache.org/repos/dist/dev/incubator/hudi> without *
*--depth=immediates*

leesf  于2020年1月21日周二 下午3:07写道:

> Hi balaji,
>
> I would not find entrypoint to create a folder under dev/incubator/hudi,
> have no permissions? Please advise. Thanks.
>
> Balaji Varadarajan  于2020年1月21日周二 下午2:14写道:
>
>>
>> Hi Leesf,
>> THe staging directories are intentionally empty. The directories
>> corresponding to 0.5.0-incubating release were deleted from staging
>> directory as the last step of the release. You can create a folder
>> "0.5.1-incubating" under dev/incubator/hudi and add the source release tar
>> balls with checksum there and commit.
>> Thanks,Balaji.V    On Monday, January 20, 2020, 09:57:27 PM PST, leesf <
>> leesf0...@gmail.com> wrote:
>>
>>  Hi all,
>>
>> I have compeleted the steps before step h(Stage source releases on
>> dist.apache.org) according to the release guide[1] , But I could not find
>> any code except KEYS in
>> https://dist.apache.org/repos/dist/dev/incubator/hudi/, so would not use
>> svn to checkout. Any suggestions?
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide
>>
>> Best,
>> Leesf
>>
>
>  

Re: Would not Stage source releases on dist.apache.org

2020-01-20 Thread Balaji Varadarajan
 
Hi Leesf,
THe staging directories are intentionally empty. The directories corresponding 
to 0.5.0-incubating release were deleted from staging directory as the last 
step of the release. You can create a folder "0.5.1-incubating" under 
dev/incubator/hudi and add the source release tar balls with checksum there and 
commit.
Thanks,Balaji.VOn Monday, January 20, 2020, 09:57:27 PM PST, leesf 
 wrote:  
 
 Hi all,

I have compeleted the steps before step h(Stage source releases on
dist.apache.org) according to the release guide[1] , But I could not find
any code except KEYS in
https://dist.apache.org/repos/dist/dev/incubator/hudi/, so would not use
svn to checkout. Any suggestions?

[1]
https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide

Best,
Leesf
  

Re: updatePartitionsToTable() is time consuming and redundant.

2020-01-19 Thread Balaji Varadarajan
 Hi Purushotham,
I am unable to reproduce same  partitions getting hive-synced locally. Can you 
add the following log message in HoodieHiveClient.java and run the code and 
send us logs.
diff --git a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java 
b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

index 4578bb2f..ba4b1147 100644

--- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

+++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

@@ -237,6 +237,8 @@ public class HoodieHiveClient {

         if (!paths.containsKey(storageValue)) {

           events.add(PartitionEvent.newPartitionAddEvent(storagePartition));

         } else if (!paths.get(storageValue).equals(fullStoragePartitionPath)) {

+          LOG.info("Partition Location changes. StorageVal=" + storageValue

+              + ", Existing Hive Path=" + paths.get(storageValue) + ", New 
Location=" + fullStoragePartitionPath);

           events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));

         }

       }

THanks,Balaji.V
On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham Pushpavanthar 
 wrote:  
 
 Hi,

I noticed that
*org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
consuming while running HUDI on set of records which contains data for
large set of partitions. All it is doing is setting location for each
updated partition path. However,
*org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
*is taking care of adding new partitions to the table.

  1. For a given table, whose base path doesn't change (usually it doesn't
  in production), why *updatePartitionsToTable() *is needed? Can you
  please throw some light on any such case where this is needed?
  2. If it is required, can we do something to optimise the time consumed
  by this operation? Currently, the *Alter Statements* are executed one by
  one on each (partition, path) pair for every updated partition.



Regards,
Purushotham Pushpavanth
  

Re: [DISCUSS] Delay code freeze date for next release until Jan 19th (Sunday)

2020-01-15 Thread Balaji Varadarajan
 +1 Sunday should give breathing space to fix the blockers.
Balaji.V
On Wednesday, January 15, 2020, 06:50:28 AM PST, Vinoth Chandar 
 wrote:  
 
 +1 from me. I feel sunday is good in general, because the weekend gives
enough time for taking care of last minute things

On Wed, Jan 15, 2020 at 2:11 AM leesf  wrote:

> Dear Community,
>
> As discussed in the weekly sync meeting, we marked that there are 5
> blockers[1] that should be resolved before cutting  next release, and
> kindly welcome to review these PRs[2]. Regrading Jan 15th is a bit tight to
> get them all on land. I propose to delay the code freeze date until Jan
> 19th, thus Sunday this week. what do you think? Thanks.
>
> Best,
> Leesf
>
> [1]
> https://issues.apache.org/jira/browse/HUDI-537
> https://issues.apache.org/jira/browse/HUDI-535
> https://issues.apache.org/jira/browse/HUDI-509
> https://issues.apache.org/jira/browse/HUDI-403
> https://issues.apache.org/jira/browse/HUDI-238
>
> [2]
> https://github.com/apache/incubator-hudi/pull/1226
> https://github.com/apache/incubator-hudi/pull/1212
> https://github.com/apache/incubator-hudi/pull/1229
>
  

Re: [DISCUSS] Hudi weekly community update

2020-01-06 Thread Balaji Varadarajan
 IIUC, this would look like a digest email summarizing discussion threads, jira 
and PR activities. 
+1
Balaji.V
On Sunday, January 5, 2020, 07:49:22 AM PST, leesf  
wrote:  
 
 Hi all,

As Hudi attracts more attention recently and the community is developing
quickly as more and more developers are participating in the community.

To further expand the visibility of hudi and make more people know better
about the progress of the community, I am wondering if we could weekly
update the hudi community as flink[1] and pulsar[2] did for a long time.
The update could also include Development, Feature, Bug Fix and Event /
News, and I can help to organize and report the weekly update. Any thoughts?

[1]
https://lists.apache.org/thread.html/5c89c150a833ecfb1cf308a324d3a9bc9cd24ad525a1554eb81dd350%40%3Cdev.flink.apache.org%3E
[2] https://streamnative.io/weekly/
  

Re: Permession for contribute to Apache Hudi

2020-01-02 Thread Balaji Varadarajan
 Added your id.  Looking forward towards your contributions :)  Welcome !!
Balaji.V
On Thursday, January 2, 2020, 05:44:51 PM PST, 谢雄 
 wrote:  
 
 Hi,

I want to contribute to Apache Hudi.
Would you please give me the contributor permission?
My JIRA ID is helloteddy.
  

Re: Contribution guidelines

2019-12-29 Thread Balaji Varadarajan
+1 Thanks for doing this Vinoth. Covers all aspects of contribution in
detail. Big +1 to code/RFC review etiquettes.

Balaji.V


On Sat, Dec 28, 2019 at 7:20 PM vino yang  wrote:

> Hi Vinoth,
>
> big +1 from my side.
>
> Thanks for spending time improving the contribution guidelines.
>
> It looks more detailed than before.
>
> With the growth of the Hudi community, we need to normalize the
> contribution guide to make the community more standardized.
>
> Best,
> Vino
>
> Vinoth Chandar  于2019年12月28日周六 上午12:18写道:
>
> > Hi all,
> >
> > In an effort to scale ourselves better as a community, I have been
> spending
> > a lot of time cleaning up JIRAs and also writing up explicitly, the
> > processes around contributions that we have been adopting..
> >
> > Please give this is a quick read
> > http://hudi.apache.org/contributing.html#life-of-a-contributor
> >
> > Thanks
> > Vinoth
> >
>


Re: Re:Re: [DISCUSS] RFC-12 : Efficient migration of large parquet tables to Apache Hudi

2019-12-15 Thread Balaji Varadarajan
 Hi Nicholas,
Once I get high level comments on the RFC,  we can have concrete subtasks 
around this. 
Balaji.V On Saturday, December 14, 2019, 07:04:52 PM PST, 蒋晓峰 
 wrote:  
 
 Hi Balaji,
About plan of "Efficient migration of large parquet tables to Apache Hudi", 
have you split the plan into multiple subtasks?
Thanks,
Nicholas


At 2019-12-14 00:18:12, "Vinoth Chandar"  wrote:
>+1 (per asf policy)
>
>+100 per my own excitement :) .. Happy to review this!
>
>On Fri, Dec 13, 2019 at 3:07 AM Balaji Varadarajan 
>wrote:
>
>> With Apache Hudi growing in popularity, one of the fundamental challenges
>> for users has been about efficiently migrating their historical datasets to
>> Apache Hudi. Apache Hudi maintains per record metadata to perform core
>> operations such as upserts and incremental pull. To take advantage of
>> Hudi’s upsert and incremental processing support, users would need to
>> rewrite their whole dataset to make it a Hudi table. This RFC provides a
>> mechanism to efficiently migrate their datasets without the need to rewrite
>> the entire dataset.
>>
>>  Please find the link for the RFC below.
>>
>>
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi
>>
>> Please review and let me know your thoughts.
>>
>> Thanks,
>> Balaji.V
>>
  

Re: [DISCUSS] Default partition path in TimestampBasedKeyGenerator

2019-12-13 Thread Balaji Varadarajan
Thanks Shahidha for the quick response.

Pratyaksh, I am ok with making the behavior consistent with other Key
generators. Please go ahead and submit a PR.

Thanks,
Balaji.V

On Thu, Dec 12, 2019 at 10:34 PM Pratyaksh Sharma 
wrote:

> Hi Shahida,
>
> Thank you for the clarification. Actually I was thinking about a corner
> case where we define the partition field and in some incoming record, the
> value for the corresponding defined partition field is not present. Such
> cases would result in exception and job will get killed.
>
> On Fri, Dec 13, 2019 at 11:02 AM Shahida Khan 
> wrote:
>
> > Hi Pratyaksh,
> >
> > As far as I understand, basic requirement of TimestampBasedKeyGenerator
> is
> > converting the partitions into timebased dateformat.
> > *e.g.* your columns is in Unix Timestamp which need to convert to
> > dateformat like '2019/12/10'
> >
> > There will never be scenario where you won't give partitions and use
> > TimestampBasedKeyGenerator.
> > Also, to use TimestampBasedKeyGenerator, mandate configs needs to be
> define
> > which actually is converting your field to partitions.
> > e.g.
> > hoodie.datasource.write.partitionpath.field= col_dtmDateTime
> >
> >
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator
> > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
> > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> >
> > I hope this help!
> >
> > *Regards,*
> > *Shahida R. Khan*
> > *+91 9167538366*
> >
> >
> > On Thu, 12 Dec 2019 at 12:53, Pratyaksh Sharma 
> > wrote:
> >
> > > Hi,
> > >
> > > If value for configured partitionPathField is not present, we are
> > > defaulting to default partition path in all the key generator classes
> > > except TimestampBasedKeyGenerator. In TimestampBasedKeyGenerator, we
> > > directly throw exception if the value is null.
> > >
> > > I wanted to know if this behaviour is intentional. Ideally we should
> > handle
> > > such cases gracefully everywhere.
> > >
> >
>


[DISCUSS] RFC-12 : Efficient migration of large parquet tables to Apache Hudi

2019-12-13 Thread Balaji Varadarajan
With Apache Hudi growing in popularity, one of the fundamental challenges
for users has been about efficiently migrating their historical datasets to
Apache Hudi. Apache Hudi maintains per record metadata to perform core
operations such as upserts and incremental pull. To take advantage of
Hudi’s upsert and incremental processing support, users would need to
rewrite their whole dataset to make it a Hudi table. This RFC provides a
mechanism to efficiently migrate their datasets without the need to rewrite
the entire dataset.

 Please find the link for the RFC below.

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi

Please review and let me know your thoughts.

Thanks,
Balaji.V


[DISCUSS] Next Apache Release

2019-12-11 Thread Balaji Varadarajan
Hello all,

In the spirit of making Apache Hudi (incubating) releases at regular
cadence, we are starting this thread to kickstart the planning and
preparatory work for next release (0.5.1).

As discussed in yesterdays meeting, the current plan is to have a release
by end of Jan 2020.

As described in the release guide (see References), the first step would be
identify the release manager for 0.5.1. This is a consensus-based decision
of the entire community. The only requirements is that the release manager
be Apache Hudi Committer as they have permissions to perform some of the
release manager's work. The committer would still need to work with PPMC to
write to Apache release repositories.

There’s no formal process, no vote requirements, and no timing requirements
when identifying release manager. Any objections should be resolved by
consensus before starting the release.

In general, the community prefers to have a rotating set of 3-5 Release
Managers. Keeping a small core set of managers allows enough people to
build expertise in this area and improve processes over time, without
Release Managers needing to re-learn the processes for each release. That
said, if you are a committer interested in serving the community in this
way, please reach out to the community on the dev@ mailing list.

If any Hudi committer is interested in being the next release manager,
please reply to this email.

References:
Planned Tickets:   Jira Tickets

Release Guide:  Release Guide


Thanks,
Balaji.V
(On behalf of Apache Hudi PPMC)


Today's meeting cancelled

2019-12-03 Thread Balaji Varadarajan
I have cancelled the weekly (9 pm PST) meeting just now. I guess many of us
are traveling or in vacation. We will meet next week same time

Balaji.V


Re: Issue while querying Hive table after updates

2019-11-20 Thread Balaji Varadarajan
Hi Gurudatt,

>From the stack-trace, it looks like you are using CombineInputFormat as
your default input format for the hive session.  If your intention is to
use combined input format, can you instead try setting default (set
hive.input.format=) to
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat ?

https://github.com/apache/incubator-hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java

Thanks,
Balaji.V


On Mon, Nov 18, 2019 at 11:15 PM Gurudatt Kulkarni 
wrote:

> Hi Bhavani Sudha,
>
> >> Are you using spark sql or Hive query?
> This happens on all hive, hive on spark, spark sql.
>
> >> the table type ,
>This happens for both copy on write and merge on read.
>
> >> configs,
>
> hoodie.upsert.shuffle.parallelism=2
> hoodie.insert.shuffle.parallelism=2
> hoodie.bulkinsert.shuffle.parallelism=2
>
> # Key fields, for kafka example
> hoodie.datasource.write.storage.type=MERGE_ON_READ
> hoodie.datasource.write.recordkey.field=record_key
> hoodie.datasource.write.partitionpath.field=timestamp
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
>
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
>
> # schema provider configs
> hoodie.deltastreamer.schemaprovider.registry.url=
> http://schema-registry:8082/subjects/tbl_test-value/versions/latest
>
> hoodie.datasource.hive_sync.database=default
> hoodie.datasource.hive_sync.table=tbl_test
> hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hive-server:1
> hoodie.datasource.hive_sync.partition_fields=datestr
>
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
>
> #Kafka props
> hoodie.deltastreamer.source.kafka.topic=tbl_test
> metadata.broker.list=kafka-1:9092,kafka-2:9092
> auto.offset.reset=smallest
> schema.registry.url=http://schema-registry:8082
>
> Spark Submit Command
>
> spark-submit --master yarn --deploy-mode cluster --name "Test Job Hoodie"
> --executor-memory 8g --driver-memory 2g --jars
>
> hdfs:///tmp/hudi/hudi-hive-bundle-0.5.1-SNAPSHOT.jar,hdfs:///tmp/hudi/hudi-spark-bundle-0.5.1-SNAPSHOT.jar,hdfs:///tmp/hudi/hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar
> --files hdfs:///tmp/hudi/hive-site.xml --class
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> hdfs:///tmp/hudi/hudi-utilities-bundle-0.5.1-SNAPSHOT.jar
> --schemaprovider-class
> org.apache.hudi.utilities.schema.SchemaRegistryProvider --source-class
> org.apache.hudi.utilities.sources.AvroKafkaSource --source-ordering-field
> timestamp --target-base-path hdfs:///tmp/hoodie/tables/tbl_test
> --filter-dupes --target-table tbl_test --storage-type MERGE_ON_READ --props
> hdfs:///tmp/hudi/config.properties --enable-hive-sync
>
> Regards,
> Gurudatt
>
>
>
> On Tue, Nov 19, 2019 at 1:11 AM Bhavani Sudha 
> wrote:
>
> > Hi Gurudatt,
> >
> > Can you share more context on the table and the query. Are you using
> spark
> > sql or Hive query? the table type , etc? Also, if you can provide a small
> > snippet to reproduce with the configs that you used, it would be useful
> to
> > debug.
> >
> > Thanks,
> > Sudha
> >
> > On Sun, Nov 17, 2019 at 11:09 PM Gurudatt Kulkarni 
> > wrote:
> >
> > > Hi All,
> > >
> > > I am facing an issue where the aggregate query fails on partitions that
> > > have more than one parquet file. But if I run a select *, query it
> > displays
> > > all results properly. Here's the stack trace of the error that I am
> > > getting. I checked the hdfs directory for the particular file and it
> > exists
> > > in the directory but some how hive is not able to find it after the
> > update.
> > >
> > > java.io.IOException: cannot find dir =
> > >
> > >
> >
> hdfs://hadoop-host:8020/tmp/hoodie/tables/tbl_test/2019/11/09/6f864d6d-40a6-4eb7-9ee8-6133a16aa9e5-0_59-22-256_20191115185447.parquet
> > > in pathToPartitionInfo: [hdfs:/tmp/hoodie/tables/tbl_test/2019/11/09]
> > >   at
> > > org.apache.hadoop.hive.ql.io
> > >
> >
> .HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:368)
> > >   at
> > > org.apache.hadoop.hive.ql.io
> > >
> >
> .HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:330)
> > >   at
> > > org.apache.hadoop.hive.ql.io
> > >
> >
> .CombineHiveInputFormat$CombineHiveInputSplit.(CombineHiveInputFormat.java:166)
> > >   at
> > > org.apache.hadoop.hive.ql.io
> > >
> .CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:460)
> > >   at
> > > org.apache.hadoop.hive.ql.io
> > > .CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:547)
> > >   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> > >   at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > >   at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > >   at 

Re: Small clarification in Hoodie Cleaner flow

2019-11-19 Thread Balaji Varadarajan
I updated the FAQ section to set defaults correctly and add more
information related to this :
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo

The cleaner retention configuration is based on counts (number of commits
to be retained) with the assumption that users need to provide a
conservative number. The historical reason was that ingestion used to run
in specific cadence (e.g every 30 mins) with the norm being an ingestion
run taking less than 30 mins. With this model, it was simpler to represent
the configuration as a count of commits to approximate the retention time.

With delta-streamer continuous mode, ingestion is allowed to be scheduled
immediately after the previous run is scheduled. I think it would make
sense to introduce a time based retention. I have created a newbie ticket
for this : https://jira.apache.org/jira/browse/HUDI-349

Pratyaksh, In sum, if the defaults are too low, use a conservative number
based on the number of ingestion runs you see in your setup. The defaults
as referenced in the code-comments needs change (from 24 to 10).(
https://jira.apache.org/jira/browse/HUDI-350)

Thanks,
Balaji.V

On Tue, Nov 19, 2019 at 1:40 AM Pratyaksh Sharma 
wrote:

> Hi,
>
> We are assuming the following in getDeletePaths() method in cleaner flow in
> case of KEEP_LATEST_COMMITS policy -
>
> /**
> * Selects the versions for file for cleaning, such that it
> * 
> * - Leaves the latest version of the file untouched - For older versions, -
> It leaves all the commits untouched which
> * has occured in last config.getCleanerCommitsRetained()
> commits - It leaves ONE commit before this
> * window. We assume that the max(query execution time) == commit_batch_time
> * config.getCleanerCommitsRetained().
> * This is 12 hours by default. This is essential to leave the file used by
> the query thats running for the max time.
> * 
> * This provides the effect of having lookback into all changes that
> happened in the last X commits. (eg: if you
> * retain 24 commits, and commit batch time is 30 mins, then you have 12 hrs
> of lookback)
> * 
> * This policy is the default.
> */
>
> I want to understand the term commit_batch_time in this assumption and the
> assumption as a whole. As per my understanding, this term refers to the
> time taken in one iteration of DeltaSync end to end (which is hardly 7-8
> minutes in my case). If my understanding is correct, then this time will
> vary depending on the size of incoming RDD. So in that case, the time
> needed for the longest query is effectively a variable. So in that case
> what is a safe option to keep for the config
> config.getCleanerCommitsRetained().
>
> Basically I want to set the config
> config.getCleanerCommitsRetained() properly for my Hudi
> instance and hence I am trying to understand the assumption. Its default
> value is 10, I want to understand if this can be reduced further without
> any query failing.
>
> Please help me with this.
>
> Regards
> Pratyaksh
>


Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread Balaji Varadarajan
+1 on the exporter tool idea.

On Mon, Nov 11, 2019 at 10:36 PM vino yang  wrote:

> Hi Shiyan,
>
> +1 for this proposal, Also, it looks like an exporter tool.
>
> @Vinoth Chandar   Any thoughts about where to place it?
>
> Best,
> Vino
>
> Vinoth Chandar  于2019年11月12日周二 上午8:58写道:
>
> > We can wait for others to chime in as well. :)
> >
> > On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu 
> > wrote:
> >
> > > Yes, Vinoth, you're right that it is more of an exporter, which
> exports a
> > > snapshot from Hudi dataset.
> > >
> > > It should support MOR too; it shall just leverage on existing
> > > SnapshotCopier logic to find the latest file slices.
> > >
> > > So is it good to create a RFC for further discussion?
> > >
> > >
> > > On Mon, Nov 11, 2019 at 4:31 PM Vinoth Chandar 
> > wrote:
> > >
> > > > What you suggest sounds more like an `Exporter` tool?  I imagine you
> > will
> > > > support MOR as well?  +1 on the idea itself. It could be useful if
> > plain
> > > > parquet snapshot was generated as a backup.
> > > >
> > > > On Mon, Nov 11, 2019 at 4:21 PM Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > The existing SnapshotCopier under Hudi Utilities is a Hudi-to-Hudi
> > copy
> > > > and
> > > > > primarily for backup purpose.
> > > > >
> > > > > I would like to start a RFC for a more generic Hudi snapshotter,
> > which
> > > > >
> > > > >- Supports existing SnapshotCopier features
> > > > >- Add option to export a Hudi dataset to plain parquet files
> > > > >   - output latest records via Spark dataframe writer
> > > > >   - remove Hudi metadata fields
> > > > >   - support custom repartition requirements
> > > > >
> > > > > Is this a good idea to start an RFC?
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raymond Xu
> > > > >
> > > >
> > >
> >
>


Re: Migrate Existing DataFrame to Hudi DataSet

2019-11-12 Thread Balaji Varadarajan
Regarding (1) , As the exception is happening inside parquet reader
(outside hudi), can you use Spark 2.3  (instead of spark 2.4 which brings
in particular version of avro/parquet) to create and ingest a brand new
dataset and try it out. This would hopefully help isolate the issue.

Regarding (2), +1 on vinoth's suggestion. But if you are very sure, can you
see if there is any pattern around missing records ? Are the missing
records all in the same partition ?

Balaji.V


On Mon, Nov 11, 2019 at 1:30 PM Zhengxiang Pan  wrote:

> Hi
>
> The snippet for issue is here
> https://gist.github.com/zxpan/c5e989958d7688026f1679e53d2fca44
> 1) write script is to simulate to migrate existing data frame (saved in 
> /tmp/hudi-testing/inserts
> parquet)
> 2) update script is to simulate to incremental update (saved in  
> /tmp/hudi-testing/updates
> parquet) the existing dataset, this is where the issue
>
> See attached inserts parquet file and updates parquet file.
>
> Your help is appreciated.
> Thanks
>
>
> On Mon, Nov 11, 2019 at 11:23 AM Zhengxiang Pan  wrote:
>
>> Thanks for quick response. will try to create snippet to reduce the issue.
>>
>> For number 2), I am aware of the de-dup behavior.  pretty sure the
>> precombine key is unique.
>>
>> Thanks
>>
>> On Mon, Nov 11, 2019 at 8:46 AM Vinoth Chandar  wrote:
>>
>>> Hi,
>>>
>>> On 1. I am wondering if its relatd to
>>> https://issues.apache.org/jira/browse/HUDI-83 , i.e support for
>>> timestamps.
>>> if you can give us a small snippet to reproduce the problem that would be
>>> great.
>>>
>>> On 2, Not sure whats going on. there are no size limitations. Please
>>> check
>>> if you precombine field and keys are correct.. for eg if you pick a
>>> field/value that is in all records,then precombine will crunch it down to
>>> just 1 record, coz thats what we ask it do.
>>>
>>> On Sun, Nov 10, 2019 at 6:46 PM Zhengxiang Pan 
>>> wrote:
>>>
>>> > Hi,
>>> > I am new to the Hudi, my first attempt is to convert my existing
>>> dataframe
>>> > to Hudi managed dataset. I follow the Quick guide and Option (2) or
>>> (3) In
>>> > Migration Guide. Got two issues
>>> >
>>> > 1) Got the following error when Append mode afterward to upsert the
>>> data
>>> > org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 4
>>> > in stage 23.0 failed 4 times, most recent failure: Lost task 4.3 in
>>> stage
>>> > 23.0 (TID 74, tkcnode49.alphonso.tv, executor 7):
>>> > org.apache.hudi.exception.HoodieUpsertException: Error upserting
>>> bucketType
>>> > UPDATE for partition :4
>>> > at
>>> >
>>> >
>>> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:261)
>>> > at
>>> >
>>> >
>>> org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
>>> > at
>>> >
>>> >
>>> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>>> > at
>>> >
>>> >
>>> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>>> > at
>>> >
>>> >
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>>> > at
>>> >
>>> >
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>>> > at
>>> >
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>>> > at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>>> > at
>>> >
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>>> > at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>>> > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>>> > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>>> > at
>>> >
>>> >
>>> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>>> > at
>>> >
>>> >
>>> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>>> > at
>>> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>>> > at
>>> >
>>> >
>>> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>>> > at
>>> >
>>> >
>>> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>>> > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>>> > at
>>> >
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>>> > at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>>> > at
>>> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>> > 

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Balaji Varadarajan
Agree with all 3 changes. The naming now looks more consistent than
earlier. +1 on them

Depending on whether we are renaming Input formats for (1) and (2) - this
could require some migration steps for

Balaji.V


On Mon, Nov 11, 2019 at 7:38 PM vino yang  wrote:

> Hi Vinoth,
>
> Thanks for bringing these proposals.
>
> +1 on all three. Especially, big +1 on the third renaming proposal.
>
> When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It easily
> mislead users on the "copy" term. And make users compare it with the
> `CopyOnWriteArrayList` data structure provided by JDK  and thoughts of the
> file systems.
>
> Best,
> Vino
>
>
> Bhavani Sudha  于2019年11月12日周二 上午9:05写道:
>
> > +1 on all three rename proposals. I think this would make the concepts
> > super easy to follow for new users.
> >
> > If changing [3] seems to be a stretch, we should definitely do [1] & [2]
> at
> > the least IMO. I will be glad to help out on the renames to whatever
> extent
> > possible should the Hudi community incline to pursue this.
> >
> > Thanks,
> > Sudha
> >
> >
> >
> > On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar 
> wrote:
> >
> > > Hello all,
> > >
> > > I wanted to raise an important topic with the community around whether
> we
> > > should rename some of our terminologies in code/docs to be more
> > > user-friendly and understandable..
> > >
> > > Let me also provide some context for each, since I am probably guilty
> of
> > > introducing most of them in the first place :).
> > >
> > > *1. Rename "views" to "query" : *Instead of saying incremental view or
> > > read-optimized view, talk about them as "incremental query" and
> > > "read-optimized query". The term "view" is very technical, and what I
> was
> > > trying to convey was that we ingest/store the data once and expose
> views
> > on
> > > top. But new users (atleast half dozen of them to me) tend to confuse
> > this
> > > with views/materialized views found in databases. Almost always we talk
> > > about views mostly in terms of expected behavior for a query on the
> > view. I
> > > am proposing to just call these different query types since its a more
> > > universally accepted terminology and IMO clearer.
> > >
> > > *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> > > Read-Optimized view only for MOR storage :* This one is probably the
> > > trickiest. Hudi was always designed with MOR in mind, even as we were
> > > working on COW storage and consequently we named the pure parquet
> backed
> > > view as Read-Optimized, hoping to name parquet + avro based view as
> > > Write-Optimized. However, we opted to name it Realtime to emphasize the
> > > data freshness aspect. In retrospect, the views should have not been
> > named
> > > after their performance characteristics but rather the classes of
> queries
> > > done on them and guarantees for those (point above #1). Moreover, once
> we
> > > have parquet embedded into the log format, then the tradeoffs may not
> be
> > > the same anyways.
> > >
> > > So combining with the renaming proposed in #1, we would end up with the
> > > following..
> > >
> > > Copy-On-Write :
> > > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > > [Old]  Incremental View => [New] Incremental Query
> > >
> > > Merge-On-Read:
> > > [Old] Realtime View => [New] Snapshot Query
> > > [Old] Incremental View => [New] Incremental Query
> > > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is
> read
> > > optimized compared to Snapshot query always, at the cost of staler
> data)
> > >
> > > Both changes #1 & #2 could be simpler changes to just code references,
> > docs
> > > and configs.. we can support both string for sometime and deprecate
> > > eventually since queries are stateless.
> > >
> > > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the
> > > design was very similar to https://en.wikipedia.org/wiki/Copy-on-write
> > > filesystems
> > > & snapshotting and we once hoped to push some of this logic into the
> > > storage itself, all in vain. but the name stuck, even though once we
> had
> > > MERGE_ON_READ the focus was often on merge costs etc, which the name
> > > COPY_ON_WRITE does not convey directly. I don't feel very strong about
> > this
> > > and there is also cost to changing this since its persisted inside
> > > hoodie.properties and we will support both strings internally in code
> for
> > > backwards compatibility anyway
> > >
> > > Naming something is very hard (yes, try :)).I believe these changes
> will
> > > make the project simpler to understand for everyone out there. We also
> > have
> > > tons of new people here, so I am also happy to let go, if its already
> > clear
> > > :)
> > >
> > > Please use the bullet number when you share your feedback so we know
> what
> > > the discussion is about.
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>


Re: DISCUSS RFC 7 - Point in time queries on Hudi table (Time-Travel)

2019-11-11 Thread Balaji Varadarajan
+1.  This would be a powerful feature which would open up use-cases
requiring repeatable query results.

Balaji.V


On Mon, Nov 11, 2019 at 8:12 AM nishith agarwal  wrote:

> Folks,
>
> Starting a discussion thread for enabling time-travel for Hudi datasets.
> Please provide feedback on the RFC here
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=135859842
> >
> .
>
> Thanks,
> Nishith
>


Re: [Discuss] Feedback on Hudi improvements

2019-11-08 Thread Balaji Varadarajan
Brandon,

Great initiative and thoughts. Thanks for writing detailed description on
what you are looking to achieve.

Here are some of my  comments/thoughts:

   1. HUDI-326 : There is some work that is happening in this direction.
   But, we should be able to collaborate on this. Siva has opened a PR (
   https://github.com/apache/incubator-hudi/pull/1004) to support delete
   using only HoodieKey (partitionPath, recordKey). Technically, we can
   support an interface for delete with only recordKeys if the index is of
   type global (Current implementation supports HoodieGlobalBloomIndex).
   Within Uber, we use Hbase as the global Hudi index to support partition
   agnostic record-key lookups. In other words, we can have 2 flavors of
   delete APIs - one with input being RDD (works for all index
   types) and another with input RDD that works with global
   index. Our vision is to support an external clustered index (global) as the
   de-facto index that resides in DFS along with dataset.
   2. HUDI-327 :  IIUC, Just like ComplexKeyGenerator, the new key
   generator would need composite keys (in this case primary and secondary for
   breaking the "null" tie ). Are you concerned about the record-key footprint
   for each key when using the key generated by ComplexKeyGenerator? In that
   case, makes sense to me. Otherwise, ComplexKeyGenerator should be able to
   handle cases when some component of it is null. right ?
   3. As for HUDI-83, at-least on the write side, we have tied this with
   spark-2.4 upgrade. There is ongoing work happening in this regard. I will
   request folks who is working on this to provide status. Last I know, we
   were running into some test failures when doing  this upgrade.  But yes, as
   this is a massive upgrade, we would need your help in reviewing, debugging
   and testing this change  :)

Others, Thoughts ?

Thanks,
Balaji.V

On Fri, Nov 8, 2019 at 2:49 PM Scheller, Brandon
 wrote:

> Hi Hudi community,
>
> We at AWS EMR are interested in starting work on a few different usability
> improvements for Hudi and we’re interested to hear your feedback.
>
> Here are some of our ideas:
> https://issues.apache.org/jira/browse/HUDI-326
> https://issues.apache.org/jira/browse/HUDI-327
>
> Additionally, we were hoping to help drive:
> https://issues.apache.org/jira/browse/HUDI-83 and its associated Hive
> Jira: https://issues.apache.org/jira/browse/HIVE-4
>
> I am looking forward to improving Hudi with you all. And feel free to let
> us know if there is anything specific, you’d like us to look at.
>
> Thanks,
> Brandon
>


New Committer : bhavanisudha

2019-11-07 Thread Balaji Varadarajan
Hello Apache Hudi Community,

The Podling Project Management Committee (PPMC) for Apache Hudi
(Incubating) has invited Bhavani Sudha Saktheeswaran to become a committer
and we are
pleased to announce that she has accepted.

Bhavani Sudha has made great impact by fixing critical issues in hudi,
displaying ownership in debugging any presto integration issues and
answering user queries.

She has also improved the overall usability of Hudi by fixing docker demo
and simplifying Quickstart pages. Bhavani Sudha is also active in dev email
list and is helping our community by answering questions and reviewing
code.

Congratulations Bhavani Sudha !!

On behalf of PPMC,
Balaji.V


Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Balaji Varadarajan
Thanks Sudha. The following times work for me :

Mon, Tue, Thursday - 9 p.m to 12 a.m PST
Wed - 5:00 to 6:00 am and 9:30 p.m to 12 a.m PST




On Wed, Nov 6, 2019 at 12:31 PM Vinoth Chandar  wrote:

> Interested.
>
> Mon-Thu  5AM-6:30AM PST
> Mon-Thu  9PM-10:30PM PST
>
>
> On Wed, Nov 6, 2019 at 12:28 PM Bhavani Sudha 
> wrote:
>
> > Hello all,
> >
> > Currently the weekly sync meeting is scheduled to run on Tuesdays from
> 9pm
> > PST to 10 pm PST. Given our users are from multiple time zones, we can
> try
> > to see if there is any overlapping time that works best. Please chime in
> on
> > what would be a suitable time for you if you are interested in attending
> > the weekly meetings.
> >
> > Thanks,
> > Sudha
> >
>


Re: [Discuss] Creation of database in Hive

2019-11-06 Thread Balaji Varadarajan
I have a different opinion on this. Usually, in production deployments
(atleast whatever I am aware of), database is generally managed at the
org/group level.  Privacy policies like ACLs are usually done at database
level and would need first level management by admins. With such a setup,
its feels safer to let database creation done through separate process and
let hudi hive sync only  alter/create tables (current setup).

Open to hearing other's thoughts.

Regards,
Balaji.V

On Wed, Nov 6, 2019 at 12:01 PM Bhavani Sudha 
wrote:

> +1 I think we should create db if it does not exist.
>
> On Tue, Nov 5, 2019 at 11:08 PM Pratyaksh Sharma 
> wrote:
>
> > Hi,
> >
> > While doing hive sync using HiveSyncTool, we first check if the target
> > table exists in hive. If not, we try to create it. However in this flow,
> if
> > the database itself does not exist, we do not create the database before
> > creating hive table, which results in exception like below -
> >
> > org.apache.hive.service.cli.HiveSQLException: Error while compiling
> > statement: FAILED: SemanticException [Error 10072]: Database does not
> > exist: test_db
> > at
> >
> >
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
> > at
> >
> >
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)
> > at
> >
> >
> org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:262)
> > at
> org.apache.hive.service.cli.operation.Operation.run(Operation.java:247)
> > at
> >
> >
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:575)
> > at
> >
> >
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:561)
> > at sun.reflect.GeneratedMethodAccessor108.invoke(Unknown Source)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> > at
> >
> >
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
> > at
> >
> >
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
> > at
> >
> >
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> > at
> >
> >
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
> > at com.sun.proxy.$Proxy68.executeStatementAsync(Unknown Source)
> > at
> >
> >
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:315)
> > at
> >
> >
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:566)
> > at
> >
> >
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1557)
> > at
> >
> >
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1542)
> > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> > at
> >
> >
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
> > at
> >
> >
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
> > ... 3 more
> > Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Database
> does
> > not exist: test_db
> > at
> >
> >
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getDatabase(BaseSemanticAnalyzer.java:2154)
> >
> >
> > So just wanted to discuss if we should try creating database first in
> above
> > case using query like -
> >
> > CREATE DATABASE|SCHEMA [IF NOT EXISTS] 
> >
>


Reg: Apache Hudi Community Weekly Sync

2019-11-05 Thread Balaji Varadarajan
Hi Hudi Community,

We will be conducting weekly conference call to better coordinate and
tackle open issues, projects, RFCs and run scrum for Apache Hudi. This will
help us manage all open projects more efficiently.

The community meeting runs weekly on Tuesdays 9pm - 10pm PST. The meeting
is open to anyone in the community. If you own any tasks in or interested
in any features that are actively being worked on, you are highly
encouraged to attend this weekly sync.

The details including agenda are available in
https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+Community+Weekly+Sync

Regards,
Balaji.V


Re: DISCUSS RFC 6 - Add indexing support to the log file

2019-10-30 Thread Balaji Varadarajan
 Thanks Vinoth for proposing a clean and extendable design. The overall design 
looks great. Another rollout option is to only use consolidated log index for 
index lookup if latest "valid" log block has been written in new format. If 
that is not the case, we can revert to scanning previous log blocks for index 
lookup.
Balaji.VOn Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani Sudha 
 wrote:  
 
 I vote for the second option. Also it can give time to analyze on how to
deal with backwards compatibility. I ll take a look at the RFC later
tonight and get back.


On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar  wrote:

> One issue I have some open questions myself
>
> Is it ok to assume log will have old data block versions, followed by new
> data block versions. For e.g, if rollout new code, then revert back then
> there could be an arbitrary mix of new and old data blocks. Handling this
> might make design/code fairly complex. Alternatively we can keep it simple
> for now, disable by default and only advise to enable for new tables or
> when hudi version is stable
>
>
> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar  wrote:
>
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
> >
> >
> > Feedback welcome, on this RFC tackling HUDI-86
> >
>
  

  1   2   >