[GitHub] [hudi] hudi-bot commented on pull request #8294: [HUDI-5985] Fix orc version for spark 3.3

2023-03-25 Thread via GitHub


hudi-bot commented on PR #8294:
URL: https://github.com/apache/hudi/pull/8294#issuecomment-1484008529

   
   ## CI report:
   
   * 334fd0b071d206ef1069009fafbc2aafcfc76294 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15923)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8293: [HUDI-5984] Enable FT for spark3.x versions in CI

2023-03-25 Thread via GitHub


hudi-bot commented on PR #8293:
URL: https://github.com/apache/hudi/pull/8293#issuecomment-1484008510

   
   ## CI report:
   
   * d084fb8ec1b25361625aaa08f654a4ff9a6a4079 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15921)
 
   * c650d31fcb10ccd38f6108c841a1df2bb22ee940 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15922)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8294: [HUDI-5985] Fix orc version for spark 3.3

2023-03-25 Thread via GitHub


hudi-bot commented on PR #8294:
URL: https://github.com/apache/hudi/pull/8294#issuecomment-1484007540

   
   ## CI report:
   
   * 334fd0b071d206ef1069009fafbc2aafcfc76294 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8293: [HUDI-5984] Enable FT for spark3.x versions in CI

2023-03-25 Thread via GitHub


hudi-bot commented on PR #8293:
URL: https://github.com/apache/hudi/pull/8293#issuecomment-1484007532

   
   ## CI report:
   
   * d084fb8ec1b25361625aaa08f654a4ff9a6a4079 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15921)
 
   * c650d31fcb10ccd38f6108c841a1df2bb22ee940 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5985) Fix incorrect orc version with spark 3.3

2023-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5985:
-
Labels: pull-request-available  (was: )

> Fix incorrect orc version with spark 3.3
> 
>
> Key: HUDI-5985
> URL: https://issues.apache.org/jira/browse/HUDI-5985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan opened a new pull request, #8294: [HUDI-5985] Fix orc version for spark 3.3

2023-03-25 Thread via GitHub


xushiyan opened a new pull request, #8294:
URL: https://github.com/apache/hudi/pull/8294

   ### Change Logs
   
   Fix orc version based on spark 3.x latest orc version.
   
   Note:
   - spark 3.3.0 uses orc 1.7.4
   - spark 3.3.1 uses orc 1.7.6
   - spark 3.3.2 uses orc 1.7.8
   
   ### Impact
   
   orc patch version upgrade may affect orc compatibility.
   
   ### Risk level
   
   Low
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8293: [HUDI-5984] Enable FT for spark3.x versions in CI

2023-03-25 Thread via GitHub


hudi-bot commented on PR #8293:
URL: https://github.com/apache/hudi/pull/8293#issuecomment-1483999684

   
   ## CI report:
   
   * d084fb8ec1b25361625aaa08f654a4ff9a6a4079 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15921)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5985) Fix incorrect orc version with spark 3.3

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5985:
-
Status: In Progress  (was: Open)

> Fix incorrect orc version with spark 3.3
> 
>
> Key: HUDI-5985
> URL: https://issues.apache.org/jira/browse/HUDI-5985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5985) Fix incorrect orc version with spark 3.3

2023-03-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-5985:


 Summary: Fix incorrect orc version with spark 3.3
 Key: HUDI-5985
 URL: https://issues.apache.org/jira/browse/HUDI-5985
 Project: Apache Hudi
  Issue Type: Improvement
  Components: dependencies
Reporter: Raymond Xu
Assignee: Raymond Xu
 Fix For: 0.13.1, 0.12.3






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5985) Fix incorrect orc version with spark 3.3

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5985:
-
Sprint: Sprint 2023-03-14

> Fix incorrect orc version with spark 3.3
> 
>
> Key: HUDI-5985
> URL: https://issues.apache.org/jira/browse/HUDI-5985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #8293: [HUDI-5984] Enable FT for spark3.x versions in CI

2023-03-25 Thread via GitHub


hudi-bot commented on PR #8293:
URL: https://github.com/apache/hudi/pull/8293#issuecomment-1483998794

   
   ## CI report:
   
   * d084fb8ec1b25361625aaa08f654a4ff9a6a4079 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5984) Enable FT coverage for all Spark 3 versions in GH actions

2023-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5984:
-
Labels: pull-request-available  (was: )

> Enable FT coverage for all Spark 3 versions in GH actions
> -
>
> Key: HUDI-5984
> URL: https://issues.apache.org/jira/browse/HUDI-5984
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan opened a new pull request, #8293: [HUDI-5984] Enable FT for spark3.x versions in CI

2023-03-25 Thread via GitHub


xushiyan opened a new pull request, #8293:
URL: https://github.com/apache/hudi/pull/8293

   ### Change Logs
   
   Enable FT coverage in GH actions CI.
   
   ### Impact
   
   More test coverage.
   
   ### Risk level
   
   NA
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-5984) Enable FT coverage for all Spark 3 versions in GH actions

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-5984:


Assignee: Raymond Xu

> Enable FT coverage for all Spark 3 versions in GH actions
> -
>
> Key: HUDI-5984
> URL: https://issues.apache.org/jira/browse/HUDI-5984
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5984) Enable FT coverage for all Spark 3 versions in GH actions

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5984:
-
Status: In Progress  (was: Open)

> Enable FT coverage for all Spark 3 versions in GH actions
> -
>
> Key: HUDI-5984
> URL: https://issues.apache.org/jira/browse/HUDI-5984
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5569) Files written by first commit/delta commit if it failed is detected as valid data files

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5569:
-
Sprint: 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> Files written by first commit/delta commit if it failed is detected as valid 
> data files
> ---
>
> Key: HUDI-5569
> URL: https://issues.apache.org/jira/browse/HUDI-5569
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> We have an method in HoodieFileGroup which detects whether a file group is 
> committed or not. If timeline is such that, 
> c1.inflight
> c2.complete
> c3.complete
>  
> when we check for c1, it will return true. 
> HoodieFileGroup.java
> {code:java}
> /**
>  * A FileSlice is considered committed, if one of the following is true - 
> There is a committed data file - There are
>  * some log files, that are based off a commit or delta commit.
>  */
> private boolean isFileSliceCommitted(FileSlice slice) {
>   if (!compareTimestamps(slice.getBaseInstantTime(), LESSER_THAN_OR_EQUALS, 
> lastInstant.get().getTimestamp())) {
> return false;
>   }
>   return timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
> } {code}
> HoodieDefaultTimeline : 
> {code:java}
> @Override
> public boolean containsOrBeforeTimelineStarts(String instant) {
>   return getInstantsAsStream().anyMatch(s -> 
> s.getTimestamp().equals(instant)) || isBeforeTimelineStarts(instant);
> } {code}
>  
> This needs to be fixed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3601) Support multi-arch builds in docker setup

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3601:
-
Sprint: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, 
Sprint 2023-03-14  (was: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28)

> Support multi-arch builds in docker setup
> -
>
> Key: HUDI-3601
> URL: https://issues.apache.org/jira/browse/HUDI-3601
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
>
> Refer [https://github.com/apache/hudi/issues/4985]
> Essentially, our current docker demo runs for linux/amd64 platform but not 
> for arm64. We should support multi-arch builds in a fully automated manner. 
> Ideal would be to simply accept a parameter in setup script:
> {code:java}
> docker/setup_demo.sh --platform linux/arm64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5238:
-
Sprint: 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28, Sprint 2023-03-14  (was: 2022/11/15, 2022/11/29, 2022/12/12, 
0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 
2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5471) Make dep tree change part of CI

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5471:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 
Sprint 2023-02-14, Sprint 2023-02-28)

> Make dep tree change part of CI
> ---
>
> Key: HUDI-5471
> URL: https://issues.apache.org/jira/browse/HUDI-5471
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5849) Sync hudi configs to catalog table

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5849:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 
Sprint 2023-02-14, Sprint 2023-02-28)

> Sync hudi configs to catalog table
> --
>
> Key: HUDI-5849
> URL: https://issues.apache.org/jira/browse/HUDI-5849
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Update hudi configs to meta sync catalogs like Glue catalog, HMS and datahub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5755) Add detailed description of OCC early conflict detection to concurrency control docs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5755:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Add detailed description of OCC early conflict detection to concurrency 
> control docs
> 
>
> Key: HUDI-5755
> URL: https://issues.apache.org/jira/browse/HUDI-5755
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5649) Unify all the loggers to slf4j

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5649:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Unify all the loggers to slf4j
> --
>
> Key: HUDI-5649
> URL: https://issues.apache.org/jira/browse/HUDI-5649
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5685) Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5685:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Fix performance gap in Bulk Insert row-writing path with enabled 
> de-duplication
> ---
>
> Key: HUDI-5685
> URL: https://issues.apache.org/jira/browse/HUDI-5685
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
> {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row 
> Writing performance will considerably degrade due to the following 
> circumstances
>  * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
> would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
> {{(partition-path, record-key)}} into N partitions
>  * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
> re-partitioning will be performed and therefore each Spark task might be 
> writing into M table partitions
>  * This in turn entails explosion in the number of created (small) files, 
> killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3088:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01, Hudi-Sprint-Mar-07, 
Hudi-Sprint-Mar-14, 2022/11/29, 2022/12/12, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01, 
Hudi-Sprint-Mar-07, Hudi-Sprint-Mar-14, 2022/11/29, 2022/12/12, Sprint 
2023-02-14, Sprint 2023-02-28)

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5757) Add Log Compaction to Write Operation docs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5757:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Add Log Compaction to Write Operation docs
> --
>
> Key: HUDI-5757
> URL: https://issues.apache.org/jira/browse/HUDI-5757
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5767) Add known regression of Hive Sync performance to release notes

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5767:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Add known regression of Hive Sync performance to release notes
> --
>
> Key: HUDI-5767
> URL: https://issues.apache.org/jira/browse/HUDI-5767
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>
> This PR fixes the Hive Sync performance: 
> https://github.com/apache/hudi/pull/7561
> We should mention this in known regression in release notes of 0.12.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grows unboundedly

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5520:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> Fail MDT when list of log files grows unboundedly
> -
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5642) Enable schema reconciliation by default

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5642:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Enable schema reconciliation by default
> ---
>
> Key: HUDI-5642
> URL: https://issues.apache.org/jira/browse/HUDI-5642
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Turn on schema reconciliation to allow wider/superset schema to be selected 
> as write schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5677) [DOCS] Update AWS libs version

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5677:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> [DOCS] Update AWS libs version
> --
>
> Key: HUDI-5677
> URL: https://issues.apache.org/jira/browse/HUDI-5677
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Sagar Sumit
>Priority: Major
>
> Update AWS libs version in https://hudi.apache.org/docs/s3_hoodie/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5475) not able to generate utilities-slim bundle dependency tree

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5475:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> not able to generate utilities-slim bundle dependency tree
> --
>
> Key: HUDI-5475
> URL: https://issues.apache.org/jira/browse/HUDI-5475
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Lokesh Jain
>Priority: Critical
>  Labels: pull-request-available
>
> run command
> {code:bash}
> mvn com.github.ferstl:depgraph-maven-plugin:4.0.2:for-artifact \
>   -DgraphFormat=text -DshowGroupIds=true -DshowVersions=true 
> -DrepeatTransitiveDependenciesInTextGraph \
>   -DgroupId=org.apache.hudi -DartifactId=hudi-utilities-slim-bundle_2.12 
> -Dversion=0.12.1
> {code}
> no tree printed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3636:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, 
Sprint 2023-03-14  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28)

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org

[jira] [Updated] (HUDI-5752) Add feature docs for Change Data Capture

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5752:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Add feature docs for Change Data Capture
> 
>
> Key: HUDI-5752
> URL: https://issues.apache.org/jira/browse/HUDI-5752
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-83:
---
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync, Usability
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Critical
>  Labels: pull-request-available, query-eng, sev:critical, 
> user-support-issues
> Fix For: 0.13.1
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3967) Automatic savepoint in Hudi

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3967:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28, Sprint 2023-03-14  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 
Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Automatic savepoint in Hudi
> ---
>
> Key: HUDI-3967
> URL: https://issues.apache.org/jira/browse/HUDI-3967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: table-service
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3529) Improve dependency management and bundling

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3529:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28, Sprint 2023-03-14  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 
Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Improve dependency management and bundling
> --
>
> Key: HUDI-3529
> URL: https://issues.apache.org/jira/browse/HUDI-3529
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1574:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28, Sprint 2023-03-14  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 
Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Critical
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3249) Performance Improvements

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3249:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28, Sprint 2023-03-14  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 
Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5464) Fix instantiation of a new partition in MDT re-using the same instant time as a regular commit

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5464:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  
(was: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 
2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Fix instantiation of a new partition in MDT re-using the same instant time as 
> a regular commit
> --
>
> Key: HUDI-5464
> URL: https://issues.apache.org/jira/browse/HUDI-5464
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> we re-use the same instant time as the commit being applied to MDT while 
> instantiating a new partition in MDT. this needs to be fixed. 
>  
> for eg, lets say we have 10 commits w/ already FILES enabled. 
> for C11, we are enabling col-stats. 
> after data table business, when we enter metadata writer instantiation, we 
> deduct that col-stats has to be instantiated and then instantiate using DC11. 
> in MDT timeline, we see dc11.req. dc11.inflight and dc11.complete. and then 
> we go ahead and apply actual C11 from DT to MDT (dc11.inflight and 
> dc11.complete is updated). here, we overwrite the same DC11 w/ records 
> pertaining to C11. 
> which is buggy. we definitely need to fix this. 
> We can add a suffix to C11 (say C11_003 or C11_001) as we do for compaction 
> and clean in MDT so that any additional operation in MDT has a diff commit 
> time format. For everything else, it should match w/ DT 1 on 1. 
>  
>  
> Impact:
> We are over-riding the same DC for two purposes which is bad. if there is a 
> crash after initializing col-stats and before applying actual C11(in above 
> context), we might mistakenly rollback col-stats initialization, but still 
> table config could say that col stats is fully ready to be served. But while 
> reading MDT, we may not read DC11 since its a failed commit. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5442:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  
(was: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 
2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.1
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplit

[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5463:
-
Sprint: 0.13.0 Final Sprint, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: 0.13.0 Final Sprint, Sprint 2023-02-14, Sprint 2023-02-28)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.1
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4937:
-
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-02-14, Sprint 
2023-02-28)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5498) Update docs for reading Hudi tables on Databricks runtime

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5498:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> Update docs for reading Hudi tables on Databricks runtime
> -
>
> Key: HUDI-5498
> URL: https://issues.apache.org/jira/browse/HUDI-5498
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.3
>
>
> We need to document how users can read Hudi tables on Databricks Spark 
> runtime. 
> Relevant fix: [https://github.com/apache/hudi/pull/7088]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5665) Re-use table configs for subsequent writes

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5665:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 
Sprint 2023-02-14, Sprint 2023-02-28)

> Re-use table configs for subsequent writes
> --
>
> Key: HUDI-5665
> URL: https://issues.apache.org/jira/browse/HUDI-5665
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
>
> we expect users to set every table config along w/ every write operation. for 
> write configs, it makes sense, but for table configs, we should be able to 
> re-use properties from existing hoodie.properties. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5575:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Support any record key generation along w/ any partition path generation for 
> row writer
> ---
>
> Key: HUDI-5575
> URL: https://issues.apache.org/jira/browse/HUDI-5575
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> HUDI-5535 adds support for record key generation along w/ any partition path 
> generation. It also separates the record key generation and partition path 
> generation into separate interfaces.
> This jira aims to add similar support for the row writer path in spark.
> cc [~shivnarayan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2506) Hudi dependency governance

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2506:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 
Sprint 2023-02-14, Sprint 2023-02-28)

> Hudi dependency governance
> --
>
> Key: HUDI-2506
> URL: https://issues.apache.org/jira/browse/HUDI-2506
> Project: Apache Hudi
>  Issue Type: Test
>  Components: dependencies, Usability
>Reporter: vinoyang
>Assignee: Lokesh Jain
>Priority: Critical
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5552) Too slow while using trino-hudi connector while querying partitioned tables.

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5552:
-
Sprint: 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> Too slow while using trino-hudi connector while querying partitioned tables.
> 
>
> Key: HUDI-5552
> URL: https://issues.apache.org/jira/browse/HUDI-5552
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: trino-presto
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> See the issue for details: [[SUPPORT] Too slow while using trino-hudi 
> connector while querying partitioned tables. · Issue #7643 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/7643]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5756) Add Consistent Hashing Index to Indexing docs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5756:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Add Consistent Hashing Index to Indexing docs
> -
>
> Key: HUDI-5756
> URL: https://issues.apache.org/jira/browse/HUDI-5756
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5769) Partitions created by Async indexer could be deleted by regular writers

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5769:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Partitions created by Async indexer could be deleted by regular writers
> ---
>
> Key: HUDI-5769
> URL: https://issues.apache.org/jira/browse/HUDI-5769
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> In regular writer we have a flow, where we detect if some MDT partition is 
> not enabled, but the partition is found in storage and as part of table 
> config's fully built out partitions, hudi deletes the metadata partition with 
> the intent that user wishes to disable it. 
> But this does not sit well w/ async indexer. 
>  
> process1 -> Deltastreamer runs continuously. 
> no metadata configs set. 
> which means, default value for metadata enable = true and hence "files" 
> partition will be instantiated inline on first commit. 
> no value set for col stats enable. So, no action will be taken. 
>  
> process2: user starts HoodieIndexer for col stats partition. 
> Once indexer completes, tableConfig will add "col stats" as part of fully 
> built out metadata partition. 
>  
> While in process1, when deltastreamer goes to next write, it will detect that 
> col stats wasn't enabled (default value as per code), but tableConfig shows 
> that col stats is fully built out, and hence decides to delete the col stats 
> partition and updates the tableConfig. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5616) Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5616:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar
> 
>
> Key: HUDI-5616
> URL: https://issues.apache.org/jira/browse/HUDI-5616
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>
> There is a usability change in [this 
> PR|https://github.com/apache/hudi/pull/7702] that requires a new conf for 
> spark users
> --conf  spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar
> There will be a hit on performance (it was actually always there) if this is 
> not specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5321) Fix Bulk Insert ColumnSortPartitioners

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5321:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Fix Bulk Insert ColumnSortPartitioners
> --
>
> Key: HUDI-5321
> URL: https://issues.apache.org/jira/browse/HUDI-5321
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, all of the Custom Bulk Insert ColumnSortPartitioner impls 
> incorrectly return "true" from the "arePartitionRecordsSorted" method, even 
> though records might not necessarily be sorted by the partition-path columns 
> as is required by this method.
> In case when such Partitioner is used and the data is NOT sorted by the list 
> of columns that start w/ partition ones, this could lead to a Parquet writers 
> being closed prematurely when writing files creating a LOT of small files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5510) The latest written commit is not used when getInstantsToArchive

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5510:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> The latest written commit is not used when getInstantsToArchive
> ---
>
> Key: HUDI-5510
> URL: https://issues.apache.org/jira/browse/HUDI-5510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: zouxxyy
>Assignee: Danny Chen
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5820) Improve Azure and GH CI's maven build with cache (3.9+)

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5820:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 
Sprint 2023-02-14, Sprint 2023-02-28)

> Improve Azure and GH CI's maven build with cache (3.9+)
> ---
>
> Key: HUDI-5820
> URL: https://issues.apache.org/jira/browse/HUDI-5820
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Assignee: Jonathan Vexler
>Priority: Major
>
> Refer to PR https://github.com/apache/hudi/pull/7935
> For Azure, we can try downloading and installing maven 3.9 and use the custom 
> maven in the maven@4 task.
> For GH actions CI, more investigation needed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3775:
-
Sprint: 2022/09/05, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: 2022/09/05, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: easyfix, pull-request-available
> Fix For: 0.14.0
>
> Attachments: impressions.avro, run_stuff.txt, scala_commands.txt
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4613) Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4613:
-
Sprint: 2022/09/05, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, 
Sprint 2023-03-14  (was: 2022/09/05, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28)

> Avoid the use of regex expressions when call hoodieFileGroup#addLogFile 
> function
> 
>
> Key: HUDI-4613
> URL: https://issues.apache.org/jira/browse/HUDI-4613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: lei w
>Assignee: lei w
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When the number of logFile files exceeds a certain amount of data, the 
> construction of fsview will become very time-consuming. The reason is that 
> the LogFileComparator#compare method is frequently called when constructing a 
> filegroup, and regular expressions are used in this method.
> {panel:title=build FileSystemView Log }
>  INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=60801, 
> NumFileGroups=200, FileGroupsCreationTime=34036, StoreTimeTaken=2
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2681) Make hoodie record_key and preCombine_key optional

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2681:
-
Sprint: 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final 
Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28)

> Make hoodie record_key and preCombine_key optional
> --
>
> Key: HUDI-2681
> URL: https://issues.apache.org/jira/browse/HUDI-2681
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core, spark-sql, writer-core
>Reporter: Vinoth Govindarajan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> At present, Hudi needs an record key and preCombine key to create an Hudi 
> datasets, which puts an restriction on the kinds of datasets we can create 
> using Hudi.
>  
> In order to increase the adoption of Hudi file format across all kinds of 
> derived datasets, similar to Parquet/ORC, we need to offer flexibility to 
> users. I understand that record key is used for upsert primitive and we need 
> preCombine key to break the tie and deduplicate, but there are event data and 
> other datasets without any primary key (append only datasets), which can 
> benefit from Hudi since Hudi ecosystem offers other features such as snapshot 
> isolation, indexes, clustering, delta streamer etc., which could be applied 
> to any datasets without record key.
>  
> The idea of this proposal is to make both the record key and preCombine key 
> optional to allow variety of new use cases on top of Hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5423) Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5423:
-
Sprint: 0.13.0 Final Sprint, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28, Sprint 2023-03-14  (was: 0.13.0 Final Sprint, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28)

> Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)
> 
>
> Key: HUDI-5423
> URL: https://issues.apache.org/jira/browse/HUDI-5423
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> {code}
> [ERROR] Tests run: 94, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
> 1,729.267 s <<< FAILURE! - in JUnit Vintage
> [ERROR] [8] 
> ColumnStatsTestCase(MERGE_ON_READ,true,true)(testMetadataColumnStatsIndex(ColumnStatsTestCase))
>   Time elapsed: 23.246 s  <<< FAILURE!
> org.opentest4j.AssertionFailedError: 
> expected: 
> <{"c1_maxValue":101,"c1_minValue":101,"c1_nullCount":0,"c2_maxValue":" 
> 999sdc","c2_minValue":" 
> 999sdc","c2_nullCount":0,"c3_maxValue":10.329,"c3_minValue":10.329,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.179Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":99,"c5_minValue":99,"c5_nullCount":0,"c6_maxValue":"2020-03-28","c6_minValue":"2020-03-28","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"SA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":562,"c1_minValue":323,"c1_nullCount":0,"c2_maxValue":" 
> 984sdc","c2_minValue":" 
> 980sdc","c2_nullCount":0,"c3_maxValue":977.328,"c3_minValue":64.768,"c3_nullCount":1,"c4_maxValue":"2021-11-19T07:34:44.201Z","c4_minValue":"2021-11-19T07:34:44.181Z","c4_nullCount":0,"c5_maxValue":78,"c5_minValue":34,"c5_nullCount":0,"c6_maxValue":"2020-10-21","c6_minValue":"2020-01-15","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"qw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":4}
> {"c1_maxValue":568,"c1_minValue":8,"c1_nullCount":0,"c2_maxValue":" 
> 8sdc","c2_minValue":" 
> 111sdc","c2_nullCount":0,"c3_maxValue":979.272,"c3_minValue":82.111,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.193Z","c4_minValue":"2021-11-19T07:34:44.159Z","c4_nullCount":0,"c5_maxValue":58,"c5_minValue":2,"c5_nullCount":0,"c6_maxValue":"2020-11-08","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"9g==","c7_minValue":"Ag==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":15}
> {"c1_maxValue":619,"c1_minValue":619,"c1_nullCount":0,"c2_maxValue":" 
> 985sdc","c2_minValue":" 
> 985sdc","c2_nullCount":0,"c3_maxValue":230.320,"c3_minValue":230.320,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":33,"c5_nullCount":0,"c6_maxValue":"2020-02-13","c6_minValue":"2020-02-13","c6_nullCount":0,"c7_maxValue":"QA==","c7_minValue":"QA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":633,"c1_minValue":624,"c1_nullCount":0,"c2_maxValue":" 
> 987sdc","c2_minValue":" 
> 986sdc","c2_nullCount":0,"c3_maxValue":580.317,"c3_minValue":375.308,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":32,"c5_nullCount":0,"c6_maxValue":"2020-10-10","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"PQ==","c7_minValue":"NA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":2}
> {"c1_maxValue":639,"c1_minValue":555,"c1_nullCount":0,"c2_maxValue":" 
> 989sdc","c2_minValue":" 
> 982sdc","c2_nullCount":0,"c3_maxValue":904.304,"c3_minValue":153.431,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.186Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":44,"c5_minValue":31,"c5_nullCount":0,"c6_maxValue":"2020-08-25","c6_minValue":"2020-03-12","c6_nullCount":0,"c7_maxValue":"MA==","c7_minValue":"rw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":3}
> {"c1_maxValue":715,"c1_minValue":76,"c1_nullCount":0,"c2_maxValue":" 
> 76sdc","c2_minValue":" 
> 224sdc","c2_nullCount":0,"c3_maxValue":958.579,"c3_minValue":246.427,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.199Z","c4_minValue":"2021-11-19T07:34:44.166Z","c4_nullCount":0,"c5_maxValue":73,"c5_minValue":9,"c5_nullCount":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-16","c6_nullCount":0,"c7_maxValue":"+g==","c7_minValue":"LA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":12}
> {"c1_maxValue":768,"c1_minValue":59,"c1_nullCount":0,"c2_maxValue":" 
> 768sdc","c2_minValue":" 

[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5352:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Jackson fails to serialize LocalDate when updating Delta Commit metadata
> 
>
> Key: HUDI-5352
> URL: https://issues.apache.org/jira/browse/HUDI-5352
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due 
> to Jackson not being able to serialize LocalData as is and requiring 
> additional JSR310 dependency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5323:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28, Sprint 
2023-03-14  (was: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28)

> Decouple virtual key with writing bloom filters to parquet files
> 
>
> Key: HUDI-5323
> URL: https://issues.apache.org/jira/browse/HUDI-5323
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
>
> When the virtual key feature is enabled by setting 
> hoodie.populate.meta.fields to false, the bloom filters are not written to 
> parquet base files in the write transactions.  Relevant logic in 
> HoodieFileWriterFactory class:
> {code:java}
> private static  
> HoodieFileWriter newParquetFileWriter(
> String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> HoodieTable hoodieTable,
> TaskContextSupplier taskContextSupplier, boolean populateMetaFields) 
> throws IOException {
>   return newParquetFileWriter(instantTime, path, config, schema, 
> hoodieTable.getHadoopConf(),
>   taskContextSupplier, populateMetaFields, populateMetaFields);
> }
> private static  
> HoodieFileWriter newParquetFileWriter(
> String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> Configuration conf,
> TaskContextSupplier taskContextSupplier, boolean populateMetaFields, 
> boolean enableBloomFilter) throws IOException {
>   Option filter = enableBloomFilter ? 
> Option.of(createBloomFilter(config)) : Option.empty();
>   HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new 
> AvroSchemaConverter(conf).convert(schema), schema, filter);
>   HoodieParquetConfig parquetConfig = new 
> HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
>   config.getParquetBlockSize(), config.getParquetPageSize(), 
> config.getParquetMaxFileSize(),
>   conf, config.getParquetCompressionRatio(), 
> config.parquetDictionaryEnabled());
>   return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, 
> taskContextSupplier, populateMetaFields);
> } {code}
> Given that bloom filters are absent, when using Bloom Index on the same 
> table, the writer encounters NPE (HUDI-5319).
> We should decouple the virtual key feature with bloom filter and always write 
> the bloom filters to the parquet files. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5853) Add infer function for BQ sync configs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5853.

Fix Version/s: 0.13.1
   0.12.3
   Resolution: Fixed

> Add infer function for BQ sync configs
> --
>
> Key: HUDI-5853
> URL: https://issues.apache.org/jira/browse/HUDI-5853
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5753) Add feature docs for Record Payload

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5753.

Resolution: Fixed

> Add feature docs for Record Payload
> ---
>
> Key: HUDI-5753
> URL: https://issues.apache.org/jira/browse/HUDI-5753
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5754) Add detailed description of GCS Incr, Proto Kafka, and Pulsar Sources in Deltastreamer page

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5754.

Resolution: Fixed

> Add detailed description of GCS Incr, Proto Kafka, and Pulsar Sources in 
> Deltastreamer page
> ---
>
> Key: HUDI-5754
> URL: https://issues.apache.org/jira/browse/HUDI-5754
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5751) Add feature docs for Metaserver

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5751.

Resolution: Fixed

> Add feature docs for Metaserver
> ---
>
> Key: HUDI-5751
> URL: https://issues.apache.org/jira/browse/HUDI-5751
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2506) Hudi dependency governance

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2506:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28  (was: Sprint 2023-02-14)

> Hudi dependency governance
> --
>
> Key: HUDI-2506
> URL: https://issues.apache.org/jira/browse/HUDI-2506
> Project: Apache Hudi
>  Issue Type: Test
>  Components: dependencies, Usability
>Reporter: vinoyang
>Assignee: Lokesh Jain
>Priority: Critical
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4613) Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4613:
-
Sprint: 2022/09/05, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  
(was: 2022/09/05, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Avoid the use of regex expressions when call hoodieFileGroup#addLogFile 
> function
> 
>
> Key: HUDI-4613
> URL: https://issues.apache.org/jira/browse/HUDI-4613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: lei w
>Assignee: lei w
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When the number of logFile files exceeds a certain amount of data, the 
> construction of fsview will become very time-consuming. The reason is that 
> the LogFileComparator#compare method is frequently called when constructing a 
> filegroup, and regular expressions are used in this method.
> {panel:title=build FileSystemView Log }
>  INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=60801, 
> NumFileGroups=200, FileGroupsCreationTime=34036, StoreTimeTaken=2
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5769) Partitions created by Async indexer could be deleted by regular writers

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5769:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Partitions created by Async indexer could be deleted by regular writers
> ---
>
> Key: HUDI-5769
> URL: https://issues.apache.org/jira/browse/HUDI-5769
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> In regular writer we have a flow, where we detect if some MDT partition is 
> not enabled, but the partition is found in storage and as part of table 
> config's fully built out partitions, hudi deletes the metadata partition with 
> the intent that user wishes to disable it. 
> But this does not sit well w/ async indexer. 
>  
> process1 -> Deltastreamer runs continuously. 
> no metadata configs set. 
> which means, default value for metadata enable = true and hence "files" 
> partition will be instantiated inline on first commit. 
> no value set for col stats enable. So, no action will be taken. 
>  
> process2: user starts HoodieIndexer for col stats partition. 
> Once indexer completes, tableConfig will add "col stats" as part of fully 
> built out metadata partition. 
>  
> While in process1, when deltastreamer goes to next write, it will detect that 
> col stats wasn't enabled (default value as per code), but tableConfig shows 
> that col stats is fully built out, and hence decides to delete the col stats 
> partition and updates the tableConfig. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5849) Sync hudi configs to catalog table

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5849:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28  (was: Sprint 2023-02-14)

> Sync hudi configs to catalog table
> --
>
> Key: HUDI-5849
> URL: https://issues.apache.org/jira/browse/HUDI-5849
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Update hudi configs to meta sync catalogs like Glue catalog, HMS and datahub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3249) Performance Improvements

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3249:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 
2023-02-14)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5464) Fix instantiation of a new partition in MDT re-using the same instant time as a regular commit

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5464:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 
2023-02-14)

> Fix instantiation of a new partition in MDT re-using the same instant time as 
> a regular commit
> --
>
> Key: HUDI-5464
> URL: https://issues.apache.org/jira/browse/HUDI-5464
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> we re-use the same instant time as the commit being applied to MDT while 
> instantiating a new partition in MDT. this needs to be fixed. 
>  
> for eg, lets say we have 10 commits w/ already FILES enabled. 
> for C11, we are enabling col-stats. 
> after data table business, when we enter metadata writer instantiation, we 
> deduct that col-stats has to be instantiated and then instantiate using DC11. 
> in MDT timeline, we see dc11.req. dc11.inflight and dc11.complete. and then 
> we go ahead and apply actual C11 from DT to MDT (dc11.inflight and 
> dc11.complete is updated). here, we overwrite the same DC11 w/ records 
> pertaining to C11. 
> which is buggy. we definitely need to fix this. 
> We can add a suffix to C11 (say C11_003 or C11_001) as we do for compaction 
> and clean in MDT so that any additional operation in MDT has a diff commit 
> time format. For everything else, it should match w/ DT 1 on 1. 
>  
>  
> Impact:
> We are over-riding the same DC for two purposes which is bad. if there is a 
> crash after initializing col-stats and before applying actual C11(in above 
> context), we might mistakenly rollback col-stats initialization, but still 
> table config could say that col stats is fully ready to be served. But while 
> reading MDT, we may not read DC11 since its a failed commit. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1574:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 
2023-02-14)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Critical
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3636:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  
(was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteExce

[jira] [Updated] (HUDI-5649) Unify all the loggers to slf4j

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5649:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Unify all the loggers to slf4j
> --
>
> Key: HUDI-5649
> URL: https://issues.apache.org/jira/browse/HUDI-5649
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5442:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 
2023-02-14)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.1
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(Backgrou

[jira] [Updated] (HUDI-5569) Files written by first commit/delta commit if it failed is detected as valid data files

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5569:
-
Sprint: 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final Sprint 2, 0.13.0 Final 
Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Files written by first commit/delta commit if it failed is detected as valid 
> data files
> ---
>
> Key: HUDI-5569
> URL: https://issues.apache.org/jira/browse/HUDI-5569
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> We have an method in HoodieFileGroup which detects whether a file group is 
> committed or not. If timeline is such that, 
> c1.inflight
> c2.complete
> c3.complete
>  
> when we check for c1, it will return true. 
> HoodieFileGroup.java
> {code:java}
> /**
>  * A FileSlice is considered committed, if one of the following is true - 
> There is a committed data file - There are
>  * some log files, that are based off a commit or delta commit.
>  */
> private boolean isFileSliceCommitted(FileSlice slice) {
>   if (!compareTimestamps(slice.getBaseInstantTime(), LESSER_THAN_OR_EQUALS, 
> lastInstant.get().getTimestamp())) {
> return false;
>   }
>   return timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
> } {code}
> HoodieDefaultTimeline : 
> {code:java}
> @Override
> public boolean containsOrBeforeTimelineStarts(String instant) {
>   return getInstantsAsStream().anyMatch(s -> 
> s.getTimestamp().equals(instant)) || isBeforeTimelineStarts(instant);
> } {code}
>  
> This needs to be fixed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4937:
-
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-02-14, Sprint 2023-02-28  (was: 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-02-14)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3088:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01, Hudi-Sprint-Mar-07, 
Hudi-Sprint-Mar-14, 2022/11/29, 2022/12/12, Sprint 2023-02-14, Sprint 
2023-02-28  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01, Hudi-Sprint-Mar-07, 
Hudi-Sprint-Mar-14, 2022/11/29, 2022/12/12, Sprint 2023-02-14)

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5754) Add detailed description of GCS Incr, Proto Kafka, and Pulsar Sources in Deltastreamer page

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5754:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Add detailed description of GCS Incr, Proto Kafka, and Pulsar Sources in 
> Deltastreamer page
> ---
>
> Key: HUDI-5754
> URL: https://issues.apache.org/jira/browse/HUDI-5754
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5752) Add feature docs for Change Data Capture

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5752:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Add feature docs for Change Data Capture
> 
>
> Key: HUDI-5752
> URL: https://issues.apache.org/jira/browse/HUDI-5752
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5463:
-
Sprint: 0.13.0 Final Sprint, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
0.13.0 Final Sprint, Sprint 2023-02-14)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.1
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-83:
---
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final Sprint, 0.13.0 Final 
Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14)

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync, Usability
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Critical
>  Labels: pull-request-available, query-eng, sev:critical, 
> user-support-issues
> Fix For: 0.13.1
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5756) Add Consistent Hashing Index to Indexing docs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5756:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Add Consistent Hashing Index to Indexing docs
> -
>
> Key: HUDI-5756
> URL: https://issues.apache.org/jira/browse/HUDI-5756
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grows unboundedly

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5520:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final Sprint, 0.13.0 Final 
Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14)

> Fail MDT when list of log files grows unboundedly
> -
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5352:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14)

> Jackson fails to serialize LocalDate when updating Delta Commit metadata
> 
>
> Key: HUDI-5352
> URL: https://issues.apache.org/jira/browse/HUDI-5352
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due 
> to Jackson not being able to serialize LocalData as is and requiring 
> additional JSR310 dependency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5665) Re-use table configs for subsequent writes

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5665:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28  (was: Sprint 2023-02-14)

> Re-use table configs for subsequent writes
> --
>
> Key: HUDI-5665
> URL: https://issues.apache.org/jira/browse/HUDI-5665
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
>
> we expect users to set every table config along w/ every write operation. for 
> write configs, it makes sense, but for table configs, we should be able to 
> re-use properties from existing hoodie.properties. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5751) Add feature docs for Metaserver

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5751:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Add feature docs for Metaserver
> ---
>
> Key: HUDI-5751
> URL: https://issues.apache.org/jira/browse/HUDI-5751
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5753) Add feature docs for Record Payload

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5753:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Add feature docs for Record Payload
> ---
>
> Key: HUDI-5753
> URL: https://issues.apache.org/jira/browse/HUDI-5753
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5321) Fix Bulk Insert ColumnSortPartitioners

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5321:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14)

> Fix Bulk Insert ColumnSortPartitioners
> --
>
> Key: HUDI-5321
> URL: https://issues.apache.org/jira/browse/HUDI-5321
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, all of the Custom Bulk Insert ColumnSortPartitioner impls 
> incorrectly return "true" from the "arePartitionRecordsSorted" method, even 
> though records might not necessarily be sorted by the partition-path columns 
> as is required by this method.
> In case when such Partitioner is used and the data is NOT sorted by the list 
> of columns that start w/ partition ones, this could lead to a Parquet writers 
> being closed prematurely when writing files creating a LOT of small files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5853) Add infer function for BQ sync configs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5853:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28  (was: Sprint 2023-02-14)

> Add infer function for BQ sync configs
> --
>
> Key: HUDI-5853
> URL: https://issues.apache.org/jira/browse/HUDI-5853
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5677) [DOCS] Update AWS libs version

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5677:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> [DOCS] Update AWS libs version
> --
>
> Key: HUDI-5677
> URL: https://issues.apache.org/jira/browse/HUDI-5677
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Sagar Sumit
>Priority: Major
>
> Update AWS libs version in https://hudi.apache.org/docs/s3_hoodie/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5820) Improve Azure and GH CI's maven build with cache (3.9+)

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5820:
-
Sprint: Sprint 2023-02-14, Sprint 2023-02-28  (was: Sprint 2023-02-14)

> Improve Azure and GH CI's maven build with cache (3.9+)
> ---
>
> Key: HUDI-5820
> URL: https://issues.apache.org/jira/browse/HUDI-5820
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Assignee: Jonathan Vexler
>Priority: Major
>
> Refer to PR https://github.com/apache/hudi/pull/7935
> For Azure, we can try downloading and installing maven 3.9 and use the custom 
> maven in the maven@4 task.
> For GH actions CI, more investigation needed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5575:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Support any record key generation along w/ any partition path generation for 
> row writer
> ---
>
> Key: HUDI-5575
> URL: https://issues.apache.org/jira/browse/HUDI-5575
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> HUDI-5535 adds support for record key generation along w/ any partition path 
> generation. It also separates the record key generation and partition path 
> generation into separate interfaces.
> This jira aims to add similar support for the row writer path in spark.
> cc [~shivnarayan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5498) Update docs for reading Hudi tables on Databricks runtime

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5498:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final Sprint, 0.13.0 Final 
Sprint 2, Sprint 2023-01-31, Sprint 2023-02-14)

> Update docs for reading Hudi tables on Databricks runtime
> -
>
> Key: HUDI-5498
> URL: https://issues.apache.org/jira/browse/HUDI-5498
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.3
>
>
> We need to document how users can read Hudi tables on Databricks Spark 
> runtime. 
> Relevant fix: [https://github.com/apache/hudi/pull/7088]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3601) Support multi-arch builds in docker setup

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3601:
-
Sprint: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 
0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  
(was: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Support multi-arch builds in docker setup
> -
>
> Key: HUDI-3601
> URL: https://issues.apache.org/jira/browse/HUDI-3601
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
>
> Refer [https://github.com/apache/hudi/issues/4985]
> Essentially, our current docker demo runs for linux/amd64 platform but not 
> for arm64. We should support multi-arch builds in a fully automated manner. 
> Ideal would be to simply accept a parameter in setup script:
> {code:java}
> docker/setup_demo.sh --platform linux/arm64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3775:
-
Sprint: 2022/09/05, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
2022/09/05, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-01-31, Sprint 2023-02-14)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: easyfix, pull-request-available
> Fix For: 0.14.0
>
> Attachments: impressions.avro, run_stuff.txt, scala_commands.txt
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5510) The latest written commit is not used when getInstantsToArchive

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5510:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> The latest written commit is not used when getInstantsToArchive
> ---
>
> Key: HUDI-5510
> URL: https://issues.apache.org/jira/browse/HUDI-5510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: zouxxyy
>Assignee: Danny Chen
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3967) Automatic savepoint in Hudi

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3967:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, 
Sprint 2023-02-28  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 
2023-02-14)

> Automatic savepoint in Hudi
> ---
>
> Key: HUDI-3967
> URL: https://issues.apache.org/jira/browse/HUDI-3967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: table-service
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5757) Add Log Compaction to Write Operation docs

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5757:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Add Log Compaction to Write Operation docs
> --
>
> Key: HUDI-5757
> URL: https://issues.apache.org/jira/browse/HUDI-5757
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5616) Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5616:
-
Sprint: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28  (was: 0.13.0 Final Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar
> 
>
> Key: HUDI-5616
> URL: https://issues.apache.org/jira/browse/HUDI-5616
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>
> There is a usability change in [this 
> PR|https://github.com/apache/hudi/pull/7702] that requires a new conf for 
> spark users
> --conf  spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar
> There will be a hit on performance (it was actually always there) if this is 
> not specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5685) Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5685:
-
Sprint: Sprint 2023-01-31, Sprint 2023-02-14, Sprint 2023-02-28  (was: 
Sprint 2023-01-31, Sprint 2023-02-14)

> Fix performance gap in Bulk Insert row-writing path with enabled 
> de-duplication
> ---
>
> Key: HUDI-5685
> URL: https://issues.apache.org/jira/browse/HUDI-5685
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
> {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row 
> Writing performance will considerably degrade due to the following 
> circumstances
>  * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
> would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
> {{(partition-path, record-key)}} into N partitions
>  * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
> re-partitioning will be performed and therefore each Spark task might be 
> writing into M table partitions
>  * This in turn entails explosion in the number of created (small) files, 
> killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5552) Too slow while using trino-hudi connector while querying partitioned tables.

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5552:
-
Sprint: 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 2023-01-31, 
Sprint 2023-02-14, Sprint 2023-02-28  (was: 0.13.0 Final Sprint 2, 0.13.0 Final 
Sprint 3, Sprint 2023-01-31, Sprint 2023-02-14)

> Too slow while using trino-hudi connector while querying partitioned tables.
> 
>
> Key: HUDI-5552
> URL: https://issues.apache.org/jira/browse/HUDI-5552
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: trino-presto
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> See the issue for details: [[SUPPORT] Too slow while using trino-hudi 
> connector while querying partitioned tables. · Issue #7643 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/7643]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5423) Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)

2023-03-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5423:
-
Sprint: 0.13.0 Final Sprint, Sprint 2023-01-31, Sprint 2023-02-14, Sprint 
2023-02-28  (was: 0.13.0 Final Sprint, Sprint 2023-01-31, Sprint 2023-02-14)

> Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)
> 
>
> Key: HUDI-5423
> URL: https://issues.apache.org/jira/browse/HUDI-5423
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> {code}
> [ERROR] Tests run: 94, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
> 1,729.267 s <<< FAILURE! - in JUnit Vintage
> [ERROR] [8] 
> ColumnStatsTestCase(MERGE_ON_READ,true,true)(testMetadataColumnStatsIndex(ColumnStatsTestCase))
>   Time elapsed: 23.246 s  <<< FAILURE!
> org.opentest4j.AssertionFailedError: 
> expected: 
> <{"c1_maxValue":101,"c1_minValue":101,"c1_nullCount":0,"c2_maxValue":" 
> 999sdc","c2_minValue":" 
> 999sdc","c2_nullCount":0,"c3_maxValue":10.329,"c3_minValue":10.329,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.179Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":99,"c5_minValue":99,"c5_nullCount":0,"c6_maxValue":"2020-03-28","c6_minValue":"2020-03-28","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"SA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":562,"c1_minValue":323,"c1_nullCount":0,"c2_maxValue":" 
> 984sdc","c2_minValue":" 
> 980sdc","c2_nullCount":0,"c3_maxValue":977.328,"c3_minValue":64.768,"c3_nullCount":1,"c4_maxValue":"2021-11-19T07:34:44.201Z","c4_minValue":"2021-11-19T07:34:44.181Z","c4_nullCount":0,"c5_maxValue":78,"c5_minValue":34,"c5_nullCount":0,"c6_maxValue":"2020-10-21","c6_minValue":"2020-01-15","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"qw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":4}
> {"c1_maxValue":568,"c1_minValue":8,"c1_nullCount":0,"c2_maxValue":" 
> 8sdc","c2_minValue":" 
> 111sdc","c2_nullCount":0,"c3_maxValue":979.272,"c3_minValue":82.111,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.193Z","c4_minValue":"2021-11-19T07:34:44.159Z","c4_nullCount":0,"c5_maxValue":58,"c5_minValue":2,"c5_nullCount":0,"c6_maxValue":"2020-11-08","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"9g==","c7_minValue":"Ag==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":15}
> {"c1_maxValue":619,"c1_minValue":619,"c1_nullCount":0,"c2_maxValue":" 
> 985sdc","c2_minValue":" 
> 985sdc","c2_nullCount":0,"c3_maxValue":230.320,"c3_minValue":230.320,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":33,"c5_nullCount":0,"c6_maxValue":"2020-02-13","c6_minValue":"2020-02-13","c6_nullCount":0,"c7_maxValue":"QA==","c7_minValue":"QA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":633,"c1_minValue":624,"c1_nullCount":0,"c2_maxValue":" 
> 987sdc","c2_minValue":" 
> 986sdc","c2_nullCount":0,"c3_maxValue":580.317,"c3_minValue":375.308,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":32,"c5_nullCount":0,"c6_maxValue":"2020-10-10","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"PQ==","c7_minValue":"NA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":2}
> {"c1_maxValue":639,"c1_minValue":555,"c1_nullCount":0,"c2_maxValue":" 
> 989sdc","c2_minValue":" 
> 982sdc","c2_nullCount":0,"c3_maxValue":904.304,"c3_minValue":153.431,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.186Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":44,"c5_minValue":31,"c5_nullCount":0,"c6_maxValue":"2020-08-25","c6_minValue":"2020-03-12","c6_nullCount":0,"c7_maxValue":"MA==","c7_minValue":"rw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":3}
> {"c1_maxValue":715,"c1_minValue":76,"c1_nullCount":0,"c2_maxValue":" 
> 76sdc","c2_minValue":" 
> 224sdc","c2_nullCount":0,"c3_maxValue":958.579,"c3_minValue":246.427,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.199Z","c4_minValue":"2021-11-19T07:34:44.166Z","c4_nullCount":0,"c5_maxValue":73,"c5_minValue":9,"c5_nullCount":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-16","c6_nullCount":0,"c7_maxValue":"+g==","c7_minValue":"LA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":12}
> {"c1_maxValue":768,"c1_minValue":59,"c1_nullCount":0,"c2_maxValue":" 
> 768sdc","c2_minValue":" 
> 118sdc","c2_nullCount":0,"c3_maxValu

  1   2   >