from:"sivabalan narayanan \(Jira\)"

[jira] [Created] (HUDI-7800) Remove usages of instant time with HoodieRecordLocation

2024-05-25 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7800:
-

 Summary: Remove usages of instant time with HoodieRecordLocation
 Key: HUDI-7800
 URL: https://issues.apache.org/jira/browse/HUDI-7800
 Project: Apache Hudi
  Issue Type: Improvement
  Components: index
Reporter: sivabalan narayanan


HoodieRecordLocation has a reference to instant time. Strictly speaking, 
partitionpath and fileId is what matters and instant time should not matter. It 
is used in other places like hbase to account for rollbacks. Atleast equals() 
in HoodieRecordLocation can ignore accounting for instant time. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-24 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

sivabalan narayanan updated HUDI-7779:
--
Description:
Archiving commits from active timeline could lead to data consistency issues on
some rarest of occasions. We should come up with proper guards to ensure we do
not make such unintended archival.

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after
t10, cleaner is supposed to clean up all file versions created at or before t6.
Say cleaner did not run(for whatever reason for next 5 commits).

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

A savepoint was added to say t3 and later removed. and still the cleaner was
never re-enabled. Even though archival would have been stopped at t3 (until
savepoint is present),but once savepoint is removed, if archival is executed,
it could archive commit t3. Which means, file versions tracked at t3 is still
not yet cleaned by the cleaner.

Reasoning:

We are good here wrt data consistency. Up until cleaner runs next time, this
older file versions might be exposed to the end-user. But time travel query is
not intended for already cleaned up commits and hence this is not an issue.
None of snapshot, time travel query or incremental query will run into issues
as they are not supposed to poll for t3.

At any later point, if cleaner is re-enabled, it will take care of cleaning up
file versions tracked at t3 commit. Just that for interim period, some older
file versions might still be exposed to readers.

b. The more tricky part is when replace commits are involved. Since replace
commit metadata in active timeline is what ensures the replaced file groups are
ignored for reads, before archiving the same, cleaner is expected to clean them
up fully. But are there chances that this could go wrong?

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

Cleaner will skip cleaning up files tracked by t3(due to the presence of
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will
be pointing to t6. And say savepoint for t3 is removed, but cleaner was
disabled. In this state of the timeline, if archival is executed, (since
t3.savepoint is removed), archival might archive t3 and t4.rc. This could lead
to data duplicates as both replaced file groups and new file groups from t4.rc
would be exposed as valid file groups.

In other words, if we were to summarize the different scenarios:

i. replaced file group is never cleaned up.
- ECTR(Earliest commit to retain) is less than this.rc and we are good.
ii. replaced file group is cleaned up.
- ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full
clean up did not happen. After savepoint is removed, and when archival is
executed, we should avoid archiving the rc of interest. This is the gap we
don't account for as of now.

We have 3 options to go about to solve this.

Option A:

Let Savepoint deletion flow take care of cleaning up the files its tracking.

cons:

Savepoint's responsibility is not removing any data files. So, from a single
user responsibility rule, this may not be right. Also, this clean up might need
to do what a clean planner might actually be doing. ie. build file system view,
understand if its supposed to be cleaned up already, and then only clean up the
files which are supposed to be cleaned up. For eg, if a file group has only one
file slice, it should not be cleaned up and scenarios like this.

Option B:

Since archival is the one which might cause data consistency issues, why not
archival do the clean up.

We need to account for concurrent cleans, failure and retry scenarios etc.
Also, we might need to build the file system view and then take a call whether
something needs to be cleaned up before archiving something.

Cons:

Again, the single user responsibility rule might be broken. Would be neat if
cleaner takes care of deleting data files and archival only takes care of
deleting/archiving timeline files.

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest
commit to

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-24 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

Reasoning:

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

In other words, if we were to summarize the different scenarios:

We have 3 options to go about to solve this.

Option A:

Let Savepoint deletion flow take care of cleaning up the files its tracking.

cons:

Option B:

Since archival is the one which might cause data consistency issues, why not
archival do the clean up.

Cons:

Again, the single user responsibility rule might be broken. Would be neat if
cleaner takes care of deleting data files and archival only takes care of
deleting/archiving timeline files.

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest
commit to

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

Reasoning:

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

In other words, if we were to summarize the different scenarios:

We have 3 options to go about to solve this.

Option A:

Let Savepoint deletion flow take care of cleaning up the files its tracking.

cons:

Option B:

Since archival is the one which might cause data consistency issues, why not
archival do the clean up.

Cons:

Again, the single user responsibility rule might be broken. Would be neat if
cleaner takes care of deleting data files and archival only takes care of
deleting/archiving timeline files.

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest
commit to

[jira] [Assigned] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7779:
-

Assignee: sivabalan narayanan

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to account for

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

Reasoning:

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

In other words, if we were to summarize the different scenarios:

We have 3 options to go about to solve this.

Option A:

Let Savepoint deletion flow take care of cleaning up the files its tracking.

cons:

Option B:

Since archival is the one which might cause data consistency issues, why not
archival do the clean up.

Cons:

Again, the single user responsibility rule might be broken. Would be neat if
cleaner takes care of deleting data files and archival only takes care of
deleting/archiving timeline files.

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest
commit to

[jira] [Commented] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849139#comment-17849139
 ] 

sivabalan narayanan commented on HUDI-7779:
---

Hey Sagar,

     I updated the Jira description w/ more details. can you check it out.

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Priority: Major
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

Reasoning:

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

In other words, if we were to summarize the different scenarios:

We have 3 options to go about to solve this.

Option A:

Let Savepoint deletion flow take care of cleaning up the files its tracking.

cons:

Option B:

Since archival is the one which might cause data consistency issues, why not
archival do the clean up.

Cons:

Again, the single user responsibility rule might be broken. Would be neat if
cleaner takes care of deleting data files and archival only takes care of
deleting/archiving timeline files.

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest
commit to

[jira] [Created] (HUDI-7780) Avoid 0 record parquet files

2024-05-19 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7780:
-

 Summary: Avoid 0 record parquet files
 Key: HUDI-7780
 URL: https://issues.apache.org/jira/browse/HUDI-7780
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


There are occasions where hudi could produce 0 record files. 

For eg,

a. entire set of records deleted in log block and due to small file handling, a 
new parquet is created with HoodieMergeHandle. 

b. During compaction, again there are chances that hudi might produce 0 record 
parquet files. 

We need to avoid such files if possible. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-18 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

Reasoning:

We are good here wrt data consistency. Up until cleaner runs next time, this
older file versions might be exposed to the end-user. But time travel query is
not intended for on-clean commits and hence this is not an issue. None of
snapshot, time travel query or incremental query will run into issues as they
are not supposed to poll for t3.

At any later point, if cleaner is re-enabled, it will take care of cleaning up
file versions tracked at t3 commit.

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

Cleaner will skip cleaning up files tracked by t3, but will clean up t4, t5 and
t6. So, earliest commit to retain will be pointing to t6. And say savepoint for
t3 is removed, but cleaner was disabled. In this state of the timeline, if
archival is executed, (since t3.savepoint is removed), archival might archive
t3 and t4.rc. This could lead to data duplicates as both replaced file groups
and new file groups from t4.rc would be exposed as valid file groups.

In other words, if we were to summarize the different scenarios:

We have 2 options to go about to solve this.

*Option A:*

Before archiving any replace commit by the archiver, lets explicitly check that
all replaced file groups are fully deleted.

Cons: Might need FileSystemView polling which might be costly.

*OptionB:*

Cleaner also tracks an additional metadata named, "fully cleaned up file
groups" at the end of clean planning and in completed clean commit metadata.

So, archival instead of polling FileSystemView (which might be costly), it can
check for clean commit metadata for the list of file groups and can deduce if
all file groups replaced by X.rc is fully deleted.

Pros:

Since clean planner anyways polls the file system view and has all file group
info already, no additional work might be required to deduce "fully cleaned up
file groups". Just that it needs to add an additional metadata.

was:
Archiving commits from active timeline could lead to data consistency issues on
some rarest of occasions. We should come up with proper guards to ensure we do
not make such unintended archival.

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside

[jira] [Created] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-18 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7779:
-

 Summary: Guarding archival to not archive unintended commits
 Key: HUDI-7779
 URL: https://issues.apache.org/jira/browse/HUDI-7779
 Project: Apache Hudi
  Issue Type: Bug
  Components: archiving
Reporter: sivabalan narayanan


Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for on-clean commits and hence this is not an issue. None of 
snapshot, time travel query or incremental query will run into issues as they 
are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3, but will clean up t4, t5 and 
t6. So, earliest commit to retain will be pointing to t6. And say savepoint for 
t3 is removed, but cleaner was disabled. In this state of the timeline, if 
archival is executed, (since t3.savepoint is removed), archival might archive 
t3 and t4.rc.  This could lead to data duplicates as both replaced file groups 
and new file groups from t4.rc would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 


i. replaced file group is never cleaned up. 
    - ECTR is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 2 options to go about to solve this.

*Option A:* 

Before archiving any replace commit by the archiver, lets explicitly check that 
all replaced file groups are fully deleted. 

Cons: Might need FileSystemView polling which might be costly. 

*OptionB:*

Cleaner also tracks an additional metadata named, "fully cleaned up file 
groups" at the end of clean planning and in completed clean commit metadata. 

So, archival instead of polling FileSystemView (which might be costly), it can 
check for clean commit metadata for the list of file groups and can deduce if 
all file groups replaced by X.rc is fully deleted. 

Pros: 

Since clean planner anyways polls the file system view and has all file group 
info already, no additional work might be required to deduce "fully cleaned up 
file groups". Just that it needs to add an additional metadata. 

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7778) Duplicate Key exception with RLI

2024-05-17 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7778:
-

 Summary: Duplicate Key exception with RLI 
 Key: HUDI-7778
 URL: https://issues.apache.org/jira/browse/HUDI-7778
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


We are occasionally hitting an exception as below meaning, two records are 
ingested to RLI for the same record key from data table. This is not expected 
to happen. 

 
{code:java}
Caused by: org.apache.hudi.exception.HoodieAppendException: Failed while 
appending records to 
file:/var/folders/ym/8yjkm3n90kq8tk4gfmvk7y14gn/T/junit2792173348364470678/.hoodie/metadata/record_index/.record-index-0009-0_00011.log.3_3-275-476
   at 
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:475)
 at org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:439) 
 at 
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:90)
 at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:355)
  ... 28 moreCaused by: org.apache.hudi.exception.HoodieException: Writing 
multiple records with same key 1 not supported for 
org.apache.hudi.common.table.log.block.HoodieHFileDataBlock at 
org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.serializeRecords(HoodieHFileDataBlock.java:146)
  at 
org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:121)
 at 
org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:166)
  at 
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:467)
 ... 31 more
Driver stacktrace:51301 [main] INFO  org.apache.spark.scheduler.DAGScheduler [] 
- Job 78 failed: collect at HoodieJavaRDD.java:177, took 0.245313 s51303 [main] 
INFO  org.apache.hudi.client.BaseHoodieClient [] - Stopping Timeline service 
!!51303 [main] INFO  org.apache.hudi.client.embedded.EmbeddedTimelineService [] 
- Closing Timeline server51303 [main] INFO  
org.apache.hudi.timeline.service.TimelineService [] - Closing Timeline 
Service51321 [main] INFO  org.apache.hudi.timeline.service.TimelineService [] - 
Closed Timeline Service51321 [main] INFO  
org.apache.hudi.client.embedded.EmbeddedTimelineService [] - Closed Timeline 
server
org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
time 197001012
at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:80)
   at 
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:47)
  at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:98)
at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:88)
at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) 
 at 
org.apache.hudi.functional.TestGlobalIndexEnableUpdatePartitions.testUdpateSubsetOfRecUpdates(TestGlobalIndexEnableUpdatePartitions.java:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
   at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
  at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
 at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
   at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
 at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
 at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
 at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
 at

[jira] [Assigned] (HUDI-7778) Duplicate Key exception with RLI

2024-05-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7778:
-

Assignee: sivabalan narayanan

> Duplicate Key exception with RLI 
> -
>
> Key: HUDI-7778
> URL: https://issues.apache.org/jira/browse/HUDI-7778
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> We are occasionally hitting an exception as below meaning, two records are 
> ingested to RLI for the same record key from data table. This is not expected 
> to happen. 
>  
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieAppendException: Failed while 
> appending records to 
> file:/var/folders/ym/8yjkm3n90kq8tk4gfmvk7y14gn/T/junit2792173348364470678/.hoodie/metadata/record_index/.record-index-0009-0_00011.log.3_3-275-476
>  at 
> org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:475)
>  at 
> org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:439)  
> at 
> org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:90)
>  at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:355)
>   ... 28 moreCaused by: org.apache.hudi.exception.HoodieException: 
> Writing multiple records with same key 1 not supported for 
> org.apache.hudi.common.table.log.block.HoodieHFileDataBlock at 
> org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.serializeRecords(HoodieHFileDataBlock.java:146)
>   at 
> org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:121)
>  at 
> org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:166)
>   at 
> org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:467)
>  ... 31 more
> Driver stacktrace:51301 [main] INFO  org.apache.spark.scheduler.DAGScheduler 
> [] - Job 78 failed: collect at HoodieJavaRDD.java:177, took 0.245313 s51303 
> [main] INFO  org.apache.hudi.client.BaseHoodieClient [] - Stopping Timeline 
> service !!51303 [main] INFO  
> org.apache.hudi.client.embedded.EmbeddedTimelineService [] - Closing Timeline 
> server51303 [main] INFO  org.apache.hudi.timeline.service.TimelineService [] 
> - Closing Timeline Service51321 [main] INFO  
> org.apache.hudi.timeline.service.TimelineService [] - Closed Timeline 
> Service51321 [main] INFO  
> org.apache.hudi.client.embedded.EmbeddedTimelineService [] - Closed Timeline 
> server
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
> time 197001012
>   at 
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:80)
>at 
> org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:47)
>   at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:98)
> at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:88)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)
>   at 
> org.apache.hudi.functional.TestGlobalIndexEnableUpdatePartitions.testUdpateSubsetOfRecUpdates(TestGlobalIndexEnableUpdatePartitions.java:225)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>  at 
>

[jira] [Assigned] (HUDI-7771) Make default hoodie record payload as OverwriteWithLatestPayload for 0.15.0

2024-05-15 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7771:
-

Assignee: sivabalan narayanan

> Make default hoodie record payload as OverwriteWithLatestPayload for 0.15.0
> ---
>
> Key: HUDI-7771
> URL: https://issues.apache.org/jira/browse/HUDI-7771
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> We made "DefaultHoodieRecordPayload" as default for 1.x. but lets keep it as 
> OverwriteWithLatestAvroPayload for 0.15.10 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7771) Make default hoodie record payload as OverwriteWithLatestPayload for 0.15.0

2024-05-15 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7771:
--
Fix Version/s: 0.15.0

> Make default hoodie record payload as OverwriteWithLatestPayload for 0.15.0
> ---
>
> Key: HUDI-7771
> URL: https://issues.apache.org/jira/browse/HUDI-7771
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0
>
>
> We made "DefaultHoodieRecordPayload" as default for 1.x. but lets keep it as 
> OverwriteWithLatestAvroPayload for 0.15.10 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7771) Make default hoodie record payload as OverwriteWithLatestPayload for 0.15.0

2024-05-15 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7771:
-

 Summary: Make default hoodie record payload as 
OverwriteWithLatestPayload for 0.15.0
 Key: HUDI-7771
 URL: https://issues.apache.org/jira/browse/HUDI-7771
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


We made "DefaultHoodieRecordPayload" as default for 1.x. but lets keep it as 
OverwriteWithLatestAvroPayload for 0.15.10 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7768) Fix failing tests for 0.15.0 release (async compaction and metadata num commits check)

2024-05-15 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7768:
-

Assignee: sivabalan narayanan

> Fix failing tests for 0.15.0 release (async compaction and metadata num 
> commits check)
> --
>
> Key: HUDI-7768
> URL: https://issues.apache.org/jira/browse/HUDI-7768
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
>  
>  
> |[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=23953=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7]
> [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=23953=logs=7601efb9-4019-552e-11ba-eb31b66593b2=d4b4e11d-8e26-50e5-a0d9-bb2d5decfeb9]
> org.apache.hudi.exception.HoodieMetadataException: Metadata table's 
> deltacommits exceeded 3: this is likely caused by a pending instant in the 
> data table. Resolve the pending instant or adjust 
> `hoodie.metadata.max.deltacommits.when_pending`, then restart the pipeline. 
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.checkNumDeltaCommits([HoodieBackedTableMetadataWriter.java:835|http://hoodiebackedtablemetadatawriter.java:835/])
>  at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.validateTimelineBeforeSchedulingCompaction([HoodieBackedTableMetadataWriter.java:1367|http://hoodiebackedtablemetadatawriter.java:1367/])
> java.lang.IllegalArgumentException: Following instants have timestamps >= 
> compactionInstant (002) Instants 
> :[[004__deltacommit__COMPLETED__20240515123806398]] at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument([ValidationUtils.java:42|http://validationutils.java:42/])
>  at 
> org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.execute([ScheduleCompactionActionExecutor.java:108|http://schedulecompactionactionexecutor.java:108/])
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7768) Fix failing tests for 0.15.0 release (async compaction and metadata num commits check)

2024-05-15 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7768:
-

 Summary: Fix failing tests for 0.15.0 release (async compaction 
and metadata num commits check)
 Key: HUDI-7768
 URL: https://issues.apache.org/jira/browse/HUDI-7768
 Project: Apache Hudi
  Issue Type: Improvement
  Components: tests-ci
Reporter: sivabalan narayanan


 

 
|[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=23953=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7]
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=23953=logs=7601efb9-4019-552e-11ba-eb31b66593b2=d4b4e11d-8e26-50e5-a0d9-bb2d5decfeb9]
org.apache.hudi.exception.HoodieMetadataException: Metadata table's 
deltacommits exceeded 3: this is likely caused by a pending instant in the data 
table. Resolve the pending instant or adjust 
`hoodie.metadata.max.deltacommits.when_pending`, then restart the pipeline. at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.checkNumDeltaCommits([HoodieBackedTableMetadataWriter.java:835|http://hoodiebackedtablemetadatawriter.java:835/])
 at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.validateTimelineBeforeSchedulingCompaction([HoodieBackedTableMetadataWriter.java:1367|http://hoodiebackedtablemetadatawriter.java:1367/])



java.lang.IllegalArgumentException: Following instants have timestamps >= 
compactionInstant (002) Instants 
:[[004__deltacommit__COMPLETED__20240515123806398]] at 
org.apache.hudi.common.util.ValidationUtils.checkArgument([ValidationUtils.java:42|http://validationutils.java:42/])
 at 
org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.execute([ScheduleCompactionActionExecutor.java:108|http://schedulecompactionactionexecutor.java:108/])

|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7756) Audit all base file readers and replace w/ file slice readers

2024-05-14 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7756:
-

 Summary: Audit all base file readers and replace w/ file slice 
readers 
 Key: HUDI-7756
 URL: https://issues.apache.org/jira/browse/HUDI-7756
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core
Reporter: sivabalan narayanan


If file slice reader is as performant as a base file reader when there are no 
log files, we should replace all base file readers w/ file slice readers. 

Just so we unify both COW and MOR code paths



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7716) Add more logs around index lookup

2024-05-06 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7716:
-

 Summary: Add more logs around index lookup
 Key: HUDI-7716
 URL: https://issues.apache.org/jira/browse/HUDI-7716
 Project: Apache Hudi
  Issue Type: Improvement
  Components: index
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7712) Account for file slices instead of just base files while initializing RLI for MOR table

2024-05-05 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7712:
-

 Summary: Account for file slices instead of just base files while 
initializing RLI for MOR table
 Key: HUDI-7712
 URL: https://issues.apache.org/jira/browse/HUDI-7712
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


we could have deletes in log files. and hence we need to account for entire 
file slice instead of just base files while initializing RLI for MOR table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7673) Enhance RLI validation w/ MDT validator for false positives

2024-05-05 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7673:
-

Assignee: sivabalan narayanan

> Enhance RLI validation w/ MDT validator for false positives
> ---
>
> Key: HUDI-7673
> URL: https://issues.apache.org/jira/browse/HUDI-7673
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> There is a chance that we could see false positive failures w/ MDT validation 
> when RLI is validated. 
>  
> When FS based record key locations are polled, we could have a pending 
> commit. and when MDT is polled or record locations, the commit could have 
> been completed. And so, rli validation could return additional record 
> locations.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7687) Instant should not be archived until replaced file groups or older file versions are deleted

2024-04-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7687:
-

Assignee: sivabalan narayanan

> Instant should not be archived until replaced file groups or older file 
> versions are deleted
> 
>
> Key: HUDI-7687
> URL: https://issues.apache.org/jira/browse/HUDI-7687
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: archive, clean
>
> When archival runs it may consider an instant as a candidate for archival 
> even if the file groups said instant replaced/updated still need to undergo a 
> `clean`. For example, consider the following scenario with clean and archived 
> scheduled/executed independently in different jobs
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some more instants are added to timeline. RC2 is now eligible to be 
> cleaned (as per the table writers' clean policy). Assume though that file 
> groups replaces by RC2 haven't been deleted yet, such as due to clean 
> repeatedly failing, async clean not being scheduled yet, or the clean failing 
> to delete said file groups.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> Now the table has the same consistency issue as seen in 
> https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups 
> are still in partition and readers may see inconsistent data. 
>  
> This situation can be avoided by ensuring that archival will "block" and no 
> go past an older instant time if it sees that said instant didn't undergo a 
> clean yet. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7655) Support configuration for clean to fail execution if there is at least one file is marked as a failed delete

2024-04-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7655:
-

Assignee: sivabalan narayanan

> Support configuration for clean to fail execution if there is at least one 
> file is marked as a failed delete
> 
>
> Key: HUDI-7655
> URL: https://issues.apache.org/jira/browse/HUDI-7655
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: clean
>
> When a HUDI clean plan is executed, any targeted file that was not confirmed 
> as deleted (or non-existing) will be marked as a "failed delete". Although 
> these failed deletes will be added to `.clean` metadata, if incremental clean 
> is used then these files might not ever be picked up again as a future clean 
> plan, unless a "full-scan" clean ends up being scheduled. In addition to 
> leading to more files unnecessarily taking up storage space for longer, then 
> can lead to the following dataset consistency issue for COW datasets:
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some completed instants later an incremental clean is scheduled. It moves 
> the "earliest commit to retain" to an time after instant time RC2, so it 
> targets f1 for deletion. But during execution of the plan, it fails to delete 
> f1.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> At this point, any job/query that reads the aforementioned partition directly 
> from the DFS file system calls (without directly using MDT FILES partition) 
> will consider both f1 and f2 as valid file groups, since RC2 is no longer in 
> active timeline. This is a data consistency issue, and will only be resolved 
> if a "full-scan" clean is triggered and deletes f1.
> This specific scenario can be avoided if the user can configure HUDI clean to 
> fail execution of a clean plan unless all files are confirmed as deleted (or 
> not existing in DFS already), "blocking" the clean. The next clean attempt 
> will re-execute this existing plan, since clean plans cannot be "rolled 
> back". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7673) Enhance RLI validation w/ MDT validator for false positives

2024-04-25 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7673:
-

 Summary: Enhance RLI validation w/ MDT validator for false 
positives
 Key: HUDI-7673
 URL: https://issues.apache.org/jira/browse/HUDI-7673
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


There is a chance that we could see false positive failures w/ MDT validation 
when RLI is validated. 

 

When FS based record key locations are polled, we could have a pending commit. 
and when MDT is polled or record locations, the commit could have been 
completed. And so, rli validation could return additional record locations.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7659) Update 0.14.0 release docs to call out that row writer w/ clustering is enabled by default

2024-04-23 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7659:
-

 Summary: Update 0.14.0 release docs to call out that row writer w/ 
clustering is enabled by default
 Key: HUDI-7659
 URL: https://issues.apache.org/jira/browse/HUDI-7659
 Project: Apache Hudi
  Issue Type: Improvement
  Components: docs
Reporter: sivabalan narayanan


Update 0.14.0 release docs to call out that row writer w/ clustering is enabled 
by default

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7645) Optimize BQ sync tool for MDT

2024-04-20 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7645:
-

 Summary: Optimize BQ sync tool for MDT
 Key: HUDI-7645
 URL: https://issues.apache.org/jira/browse/HUDI-7645
 Project: Apache Hudi
  Issue Type: Improvement
  Components: meta-sync
Reporter: sivabalan narayanan


Looks like in BQ sync, we are polling fsview for latest files sequentially for 
every partition. 

 

When MDT is enabled, we could load all partitions in one call. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7644:
--
Fix Version/s: 1.0.0

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7644:
--
Fix Version/s: 0.15.0

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0
>
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7644:
-

Assignee: sivabalan narayanan

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7644:
-

 Summary: Add record key info with RLI validation in MDT Validator
 Key: HUDI-7644
 URL: https://issues.apache.org/jira/browse/HUDI-7644
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata, tests-ci
Reporter: sivabalan narayanan


Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-19 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7641:
--
Fix Version/s: 0.15.0

> Add metrics to track what partitions are enabled in MDT
> ---
>
> Key: HUDI-7641
> URL: https://issues.apache.org/jira/browse/HUDI-7641
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-19 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7641:
-

Assignee: sivabalan narayanan

> Add metrics to track what partitions are enabled in MDT
> ---
>
> Key: HUDI-7641
> URL: https://issues.apache.org/jira/browse/HUDI-7641
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-18 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7641:
-

 Summary: Add metrics to track what partitions are enabled in MDT
 Key: HUDI-7641
 URL: https://issues.apache.org/jira/browse/HUDI-7641
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7624) Fix index lookup duration to track tag location duration

2024-04-16 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7624:
-

 Summary: Fix index lookup duration to track tag location duration
 Key: HUDI-7624
 URL: https://issues.apache.org/jira/browse/HUDI-7624
 Project: Apache Hudi
  Issue Type: Bug
  Components: index
Reporter: sivabalan narayanan


With spark lazy evaluation, we can't start a timer before tagLocation call and 
end the timer later. This may not give us the right value for tag location 
duration. So, we need to fix the duration properly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Fix Version/s: 1.0.0

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Fix Version/s: 0.15.0

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4699) Primary key-less data model

2024-04-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-4699.
---

> Primary key-less data model
> ---
>
> Key: HUDI-4699
> URL: https://issues.apache.org/jira/browse/HUDI-4699
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>
> Hudi requires users to specify a primary key field. Can we do away with this 
> requirement? This epic tracks the work to support use cases which does not 
> require primary key based data modelling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4699) Primary key-less data model

2024-04-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4699.
-
Fix Version/s: 0.14.0
   Resolution: Fixed

> Primary key-less data model
> ---
>
> Key: HUDI-4699
> URL: https://issues.apache.org/jira/browse/HUDI-4699
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Hudi requires users to specify a primary key field. Can we do away with this 
> requirement? This epic tracks the work to support use cases which does not 
> require primary key based data modelling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (HUDI-4699) Primary key-less data model

2024-04-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reopened HUDI-4699:
---
Assignee: sivabalan narayanan

> Primary key-less data model
> ---
>
> Key: HUDI-4699
> URL: https://issues.apache.org/jira/browse/HUDI-4699
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Hudi requires users to specify a primary key field. Can we do away with this 
> requirement? This epic tracks the work to support use cases which does not 
> require primary key based data modelling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7556) Fix MDT validator to account for additional partitions in MDT

2024-03-29 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7556:
-

 Summary: Fix MDT validator to account for additional partitions in 
MDT
 Key: HUDI-7556
 URL: https://issues.apache.org/jira/browse/HUDI-7556
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


There is a chance that MDT could list additional partitions when compared to FS 
based listing. 

reason is: 

We load active timeline from metaclient and poll FS based listing for completed 
commits. And then we poll MDT for list of all partitions. in between these two, 
there could be a commit that could have been completed and hence MDT could be 
serving that as well. So, lets account for that in our validation tool 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7549) Data inconsistency issue w/ spurious log block detection

2024-03-25 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7549:
-

 Summary: Data inconsistency issue w/ spurious log block detection
 Key: HUDI-7549
 URL: https://issues.apache.org/jira/browse/HUDI-7549
 Project: Apache Hudi
  Issue Type: Bug
  Components: reader-core
Reporter: sivabalan narayanan


We added support to deduce spurious log blocks with log block reader 

[https://github.com/apache/hudi/pull/9545]

[https://github.com/apache/hudi/pull/9611] 

in 0.14.0. 

Aparrently there are some cases where it could lead to data loss or data 
consistency issues. 

[https://github.com/apache/hudi/pull/9611#issuecomment-2016687160] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7532) Fix schedule compact to only consider DCs after last compaction commit

2024-03-22 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7532:
-

 Summary: Fix schedule compact to only consider DCs after last 
compaction commit 
 Key: HUDI-7532
 URL: https://issues.apache.org/jira/browse/HUDI-7532
 Project: Apache Hudi
  Issue Type: Bug
  Components: compaction
Reporter: sivabalan narayanan


Fix schedule compact to only consider DCs after last compaction commit. As of 
now, it also considers replace commit. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7528) Fix RowCustomColumnsSortPartitioner to use repartition instead of coalesce

2024-03-22 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7528:
-

 Summary: Fix RowCustomColumnsSortPartitioner to use repartition 
instead of coalesce
 Key: HUDI-7528
 URL: https://issues.apache.org/jira/browse/HUDI-7528
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: sivabalan narayanan


Fix RowCustomColumnsSortPartitioner to use repartition instead of coalesce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7526) Fix constructors for all bulk insert sort partitioners to ensure we could use it as user defined partitioners

2024-03-21 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7526:
-

 Summary: Fix constructors for all bulk insert sort partitioners to 
ensure we could use it as user defined partitioners 
 Key: HUDI-7526
 URL: https://issues.apache.org/jira/browse/HUDI-7526
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: sivabalan narayanan


Our constructor for user defined sort partitioner takes in write config, while 
some of the partitioners used in out of the box sort mode, does not account for 
it. 

 

Lets fix the sort partitioners to ensure anything can be used as user defined 
partitioners. 

For eg, NoneSortMode does not have a constructor that takes in write config 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828123#comment-17828123
 ] 

sivabalan narayanan edited comment on HUDI-7507 at 3/19/24 1:12 AM:


Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 


was (Author: shivnarayan):
Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828142#comment-17828142
 ] 

sivabalan narayanan commented on HUDI-7507:
---

We have already fixed it w/ latest master (1.0) by generating the new commit 
times using locks. That should solve the issue. We can apply the same to 0.X 
branch. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828123#comment-17828123
 ] 

sivabalan narayanan commented on HUDI-7507:
---

Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7511) Offset range calculation in kafka should return all topic partitions

2024-03-17 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7511:
-

 Summary: Offset range calculation in kafka should return all topic 
partitions 
 Key: HUDI-7511
 URL: https://issues.apache.org/jira/browse/HUDI-7511
 Project: Apache Hudi
  Issue Type: Bug
  Components: deltastreamer
Reporter: sivabalan narayanan


after [https://github.com/apache/hudi/pull/10869] got landed, we are not 
returning every topic partition in final ranges. But for checkpointing purpose, 
we need to have every kafka topic partition in final ranges even if we are not 
consuming anything. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7491) Handle null extra metadata w/ clean commit metadata

2024-03-07 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7491:
-

 Summary: Handle null extra metadata w/ clean commit metadata
 Key: HUDI-7491
 URL: https://issues.apache.org/jira/browse/HUDI-7491
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Reporter: sivabalan narayanan


[https://github.com/apache/hudi/pull/10651/] 

 

After this fix, older clean commits may not have any extra metadata. we need to 
handle null for the entire map. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

2024-03-07 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7490:
--
Description: 
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

Scenario above patch fixes:

By default incremental cleaner is enabled. Cleaner during planning, will only 
account for partitions touched in recent commits (after earliest commit to 
retain from last completed clean). 

So, if there is a savepoint added and removed later on, cleaner might miss to 
take care of cleaning. So, we fixed the gap in above patch. 

Fix: Clean commit metadata will track savepointed commits. So, next time when 
clean planner runs, we find the mis-match b/w tracked savepointed commits and 
current savepoints from timeline and if there is a difference, cleaner will 
account for partittions touched by the savepointed commit.  

 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 

  was:
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 


> Fix archival guarding data files not yet cleaned up by cleaner when savepoint 
> is removed
> 
>
> Key: HUDI-7490
> URL: https://issues.apache.org/jira/browse/HUDI-7490
> Project: Apache Hudi
>

[jira] [Updated] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

2024-03-07 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7490:
--
Description: 
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 

  was:
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 


> Fix archival guarding data files not yet cleaned up by cleaner when savepoint 
> is removed
> 
>
> Key: HUDI-7490
> URL: https://issues.apache.org/jira/browse/HUDI-7490
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving, cleaning, clustering
>Reporter: sivabalan narayanan
>Priority: Major
>
> We added a fix recently where cleaner will take care of cleaning up 
> savepointed files too w/o fail with 
> [https://github.com/apache/hudi/pull/10651] 
>  
> But we might have a gap wrt archival. 
> If we ensure archival will run just after cleaning and not independently, we 
> should be good.
> but if there is a chance we could expose duplicate data to readers w/ below 
> scenario. 
>  
> lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
> files created at t5 and went past it. and say we have a replace commit at t10 
> which replaced all data files that were created at t5. 
> w/ this state, say we removed the savepoint. 
> we will have data files created by t5.commit in

[jira] [Created] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

2024-03-07 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7490:
-

 Summary: Fix archival guarding data files not yet cleaned up by 
cleaner when savepoint is removed
 Key: HUDI-7490
 URL: https://issues.apache.org/jira/browse/HUDI-7490
 Project: Apache Hudi
  Issue Type: Bug
  Components: archiving, cleaning, clustering
Reporter: sivabalan narayanan


We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7478) Fix max delta commits guard check w/ MDT

2024-03-04 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7478:
-

 Summary: Fix max delta commits guard check w/ MDT 
 Key: HUDI-7478
 URL: https://issues.apache.org/jira/browse/HUDI-7478
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


protected static void checkNumDeltaCommits(HoodieTableMetaClient metaClient, 
int maxNumDeltaCommitsWhenPending) \{
final HoodieActiveTimeline activeTimeline = 
metaClient.reloadActiveTimeline();
Option lastCompaction = 
activeTimeline.filterCompletedInstants()
.filter(s -> s.getAction().equals(COMPACTION_ACTION)).lastInstant();
int numDeltaCommits = lastCompaction.isPresent()
? 
activeTimeline.getDeltaCommitTimeline().findInstantsAfter(lastCompaction.get().getTimestamp()).countInstants()
: activeTimeline.getDeltaCommitTimeline().countInstants();
if (numDeltaCommits > maxNumDeltaCommitsWhenPending) {
  throw new HoodieMetadataException(String.format("Metadata table's 
deltacommits exceeded %d: "
  + "this is likely caused by a pending instant in the data table. 
Resolve the pending instant "
  + "or adjust `%s`, then restart the pipeline.",
  maxNumDeltaCommitsWhenPending, 
HoodieMetadataConfig.METADATA_MAX_NUM_DELTACOMMITS_WHEN_PENDING.key()));
}
  } 






Here we account for action type "compaction. But compaction completed instant 
will have "commit" as action. So, we need to fix it. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7460) Fix compaction schedule with pending delta commits

2024-02-29 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7460:
-

 Summary: Fix compaction schedule with pending delta commits
 Key: HUDI-7460
 URL: https://issues.apache.org/jira/browse/HUDI-7460
 Project: Apache Hudi
  Issue Type: Improvement
  Components: compaction
Reporter: sivabalan narayanan


Hudi has a constraint that compaction schedule can happen only if there are no 
pending delta commits whose instant time < compaction instant being scheduled. 
We were throwing exception when this condition is not met. wee should fix the 
user behavior here so that we do not throw exception and return an empty plan 
when this condition is met.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7429) Fix avg record size estimation for delta commits and replace commits

2024-02-20 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7429:
-

Assignee: sivabalan narayanan

> Fix avg record size estimation for delta commits and replace commits
> 
>
> Key: HUDI-7429
> URL: https://issues.apache.org/jira/browse/HUDI-7429
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> avg record size calculation only considers COMMIT for now. lets fix it to 
> include delta commit and replace commits as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7429) Fix avg record size estimation for delta commits and replace commits

2024-02-20 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7429:
-

 Summary: Fix avg record size estimation for delta commits and 
replace commits
 Key: HUDI-7429
 URL: https://issues.apache.org/jira/browse/HUDI-7429
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


avg record size calculation only considers COMMIT for now. lets fix it to 
include delta commit and replace commits as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs

2024-02-13 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7407:
-

Assignee: sivabalan narayanan

> Add optional clean support to standalone compaction and clustering jobs
> ---
>
> Key: HUDI-7407
> URL: https://issues.apache.org/jira/browse/HUDI-7407
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Lets add top level config to standalone compaction and clustering job to 
> optionally clean. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs

2024-02-13 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7407:
-

 Summary: Add optional clean support to standalone compaction and 
clustering jobs
 Key: HUDI-7407
 URL: https://issues.apache.org/jira/browse/HUDI-7407
 Project: Apache Hudi
  Issue Type: Improvement
  Components: table-service
Reporter: sivabalan narayanan


Lets add top level config to standalone compaction and clustering job to 
optionally clean. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7397) Add support to purge a clustering instant

2024-02-09 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7397:
-

 Summary: Add support to purge a clustering instant
 Key: HUDI-7397
 URL: https://issues.apache.org/jira/browse/HUDI-7397
 Project: Apache Hudi
  Issue Type: Improvement
  Components: clustering
Reporter: sivabalan narayanan


As of now, if a user made some mistake on clustering params and wishes to 
completely purge a pending clustering, we do not have any support for that. 
would be good to add the support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-7320) hive-sync unexpectedly loads archived timeline

2024-01-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812499#comment-17812499
 ] 

sivabalan narayanan commented on HUDI-7320:
---

We did fix something on these lines already. Can you check if its reproducible 
w/ 0.14.0 as well ? 

 

> hive-sync unexpectedly loads archived timeline
> --
>
> Key: HUDI-7320
> URL: https://issues.apache.org/jira/browse/HUDI-7320
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Affects Versions: 0.13.1
>Reporter: Raymond Xu
>Priority: Critical
> Attachments: Screenshot 2024-01-16 at 5.49.25 PM.png, Screenshot 
> 2024-01-16 at 5.49.30 PM.png
>
>
> investigation shows that hive-sync step loaded archived timeline and caused 
> long delay in the overall write process. And full scan for changes in all 
> partitions is not used. need to dig further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7331) Test and certify col stats integration with MOR table

2024-01-24 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7331:
-

 Summary: Test and certify col stats integration with MOR table
 Key: HUDI-7331
 URL: https://issues.apache.org/jira/browse/HUDI-7331
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


Lets test and certify col stats integration with MOR table for all operations.

for eg, any write operations (bulk insert, insert, upsert, insert overwrite) 
should add new entries to col stats index in metadata table. 

rollback: 

for files that were deleted should be removed from col stats (data files). 

for log files added, we should add new entries to col stats 

 

clean: 

any files deleted (data files and log files) should have the entries removed 
from col stats in MDT. 

 

Similarly, lets also do similar exercise with delete partition and other 
operations we have with hudi. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7272) Cut docs for 0.14.1

2024-01-03 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7272:
-

 Summary: Cut docs for 0.14.1
 Key: HUDI-7272
 URL: https://issues.apache.org/jira/browse/HUDI-7272
 Project: Apache Hudi
  Issue Type: Improvement
  Components: docs
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6112) Improve Doc generatiion to generate config tables for basic and advanced configs

2024-01-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6112:
--
Fix Version/s: 1.1.0
   (was: 0.14.1)

> Improve Doc generatiion to generate config tables for basic and advanced 
> configs
> 
>
> Key: HUDI-6112
> URL: https://issues.apache.org/jira/browse/HUDI-6112
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> The HoodieConfigDocGenerator will need to be modified such that:
>  * Each config group has two sections: basic configs and advanced configs
>  * Basic configs and Advanced configs are played out in a table instead of a 
> serially like today.
>  * Among each of these tables the required configs are bubbled up to the top 
> of the table and highlighted.
> Add UI fixes to support a table layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6932) Fix batch size for delete partition for AWSGlueCatalogSyncClient

2024-01-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6932.
-
Resolution: Fixed

> Fix batch size for delete partition for AWSGlueCatalogSyncClient
> 
>
> Key: HUDI-6932
> URL: https://issues.apache.org/jira/browse/HUDI-6932
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Github Issue - [https://github.com/apache/hudi/issues/9806]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7100) Data loss when using insert_overwrite_table with insert.drop.duplicates

2024-01-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7100.
-
Resolution: Fixed

> Data loss when using insert_overwrite_table with insert.drop.duplicates
> ---
>
> Key: HUDI-7100
> URL: https://issues.apache.org/jira/browse/HUDI-7100
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.4, 0.14.1, 0.13.2
>
>
> Code to reproduce - 
> Github Issue - [https://github.com/apache/hudi/issues/9967]
> ```
> schema = StructType(
> [
> StructField("id", IntegerType(), True),
> StructField("name", StringType(), True)
> ]
> )
> data = [
> Row(1, "a"),
> Row(2, "a"),
> Row(3, "c"),
> ]
> hudi_configs = {
> "hoodie.table.name": TABLE_NAME,
> "hoodie.datasource.write.recordkey.field": "name",
> "hoodie.datasource.write.precombine.field": "id",
> "hoodie.datasource.write.operation":"insert_overwrite_table",
> "hoodie.table.keygenerator.class": 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
> }
> df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()
> -- Showing no records
> ```
> df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7120) Performance improvements in deltastreamer executor code path

2024-01-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7120.
-
  Assignee: Lokesh Jain
Resolution: Fixed

> Performance improvements in deltastreamer executor code path
> 
>
> Key: HUDI-7120
> URL: https://issues.apache.org/jira/browse/HUDI-7120
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Makes improvements based on findings from CPU profiling for the executor code 
> path.
> 1. Fixes repetitive execution of string split operation
> 2. reduces number of validation calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6954) Corrupted column stats in metadata table in non-partitioned table

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6954.
-
  Assignee: sivabalan narayanan
Resolution: Fixed

> Corrupted column stats in metadata table in non-partitioned table
> -
>
> Key: HUDI-6954
> URL: https://issues.apache.org/jira/browse/HUDI-6954
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> After compaction in MDT, the column stats entries in metadata table for a 
> non-partitioned data table are corrupted, with wrong encoded part of 
> partition path in the key.  This makes some column stats entires not 
> searchable through the key based on column name, partition path, and the file 
> name as the key is wrong in the column stats partition in MDT. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7135) Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7135:
--
Fix Version/s: 0.14.1

> Spark reads hudi table error when flink creates the table without preCombine 
> fields by catalog or factory
> -
>
> Key: HUDI-7135
> URL: https://issues.apache.org/jira/browse/HUDI-7135
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 1.0.0
>
>
> Create a table through dfs catalog, hms catalog, or sink ddl, and then query 
> the data of the table through spark, and an exception occurs：
> Java. util. NoSuchElementException: key not found: ts
> demo:
>  1. create a table through hms catalog:
> {panel:title=hms catalog create table}
> CREATE CATALOG hudi_catalog WITH(
> 'type' = 'hudi',
> 'mode' = 'hms'
> );
> CREATE TABLE hudi_catalog.`default`.ct1
> (
>   f1 string,
>   f2 string
> ) WITH (
>   'connector' = 'hudi',
>   'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ct1',
>   'table.type' = 'COPY_ON_WRITE',
>   'write.operation' = 'insert'
> );
> {panel}
> 2. spark query
> {panel:title=spark query}
> select * from ct1
> {panel}
> 3. exception
> {panel:title=exception}
> java.util.NoSuchElementException: key not found: ts
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6012) delete base path when failed to run bootstrap procedure

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6012.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> delete base path when failed to run bootstrap procedure
> ---
>
> Key: HUDI-6012
> URL: https://issues.apache.org/jira/browse/HUDI-6012
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: bootstrap
>Reporter: lvyanquan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> [run_bootstrap](https://hudi.apache.org/docs/next/procedures#run_bootstrap) 
> procedure is called like this 
> {code:java}
> call run_bootstrap(table => 'test_hudi_table', table_type => 'COPY_ON_WRITE', 
> bootstrap_path => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table', 
> base_path => 'hdfs://ns1//tmp/hoodie/test_hudi_table', rowKey_field => 'id', 
> partition_path_field => 'dt'); {code}
> some exceptional cases this procedure will fail, for example, bootstrap_path 
> is not existed or empty.  The  `base_path` in HDFS was still remained with 
> `.hoodie` directory.
> Though we can still rerun bootstrap procedure and pass `bootstrap_overwrite` 
> parameter, it's better to clean this path that we created after failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6094) Make Kafka send record from async to sync

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6094.
-
Fix Version/s: 1.1.0
   Resolution: Fixed

> Make Kafka send record from async to sync
> -
>
> Key: HUDI-6094
> URL: https://issues.apache.org/jira/browse/HUDI-6094
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: DuBin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> In call method from HoodieWriteCommitKafkaCallback in hudi-utilities module, 
> the kafka send is async, how about to make the send to sync to ensure the 
> kafka send call complete. No performance degradation, because the send call 
> is in a try-with-resource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7190) Spark33LegacyHoodieParquetFileFormat failed to read parquet when nested type vectorized read enable

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7190.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Spark33LegacyHoodieParquetFileFormat failed to read parquet when nested type 
> vectorized read enable
> ---
>
> Key: HUDI-7190
> URL: https://issues.apache.org/jira/browse/HUDI-7190
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: image-2023-12-07-15-41-29-452.png
>
>
> For Spark3.3+ version, we can do vectorized read for nested columns. However 
> when 
> `spark.sql.parquet.enableNestedColumnVectorizedReader = true` and 
>   `set spark.sql.parquet.enableVectorizedReader = true` is set,  hudi will 
> throw the following exception: 
>  !image-2023-12-07-15-41-29-452.png! 
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7223) Hudi Cleaner removing files still required for view N hours old

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7223.
-
Fix Version/s: 0.14.1
 Assignee: Timothy Brown
   Resolution: Fixed

> Hudi Cleaner removing files still required for view N hours old
> ---
>
> Key: HUDI-7223
> URL: https://issues.apache.org/jira/browse/HUDI-7223
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> If a user is using time based cleaner policy, they will expect that they can 
> query the table state as of N hours ago. This means that they do not want to 
> clean up files older than N hours but files that are no longer relevant to 
> the table N hours ago. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7215) Delete NewHoodieParquetFileFormat and all references

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7215:
--
Status: Patch Available  (was: In Progress)

> Delete NewHoodieParquetFileFormat and all references
> 
>
> Key: HUDI-7215
> URL: https://issues.apache.org/jira/browse/HUDI-7215
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> HoodieFileGroupReaderBasedParquetFileFormat now has feature parity with 
> NewHoodieParquetFileFormat and no new work will be done on 
> NewHoodieParquetFileFormat. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7213) When using wrong tabe.type value in hudi catalog happends npe

2023-12-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7213.
-
Fix Version/s: 0.14.1
   (was: 1.0.0)
   Resolution: Fixed

>  When using wrong tabe.type value in hudi catalog happends npe
> --
>
> Key: HUDI-7213
> URL: https://issues.apache.org/jira/browse/HUDI-7213
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: jack Lei
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
>  
> {code:java}
> --创建hudi res表
> create table IF NOT EXISTS hudi_catalog.tmp.test_metric_hudi_metastore_mor_new
> (
>`id` string COMMENT 'id',
>`name` string COMMENT '名称',
>`pmerge` string COMMENT '合并字段',
> dt string
> ) PARTITIONED BY (dt) WITH 
> ('connector'='hudi',   
>  'table.type'='MERGE_ON_WRITE',
>  'write.operation'='insert'); {code}
> table.type is wrong 
> then appears
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieCatalogException: Failed to create 
> table tmp.test_metric_hudi_metastore_mor_newat 
> org.apache.hudi.table.catalog.HoodieHiveCatalog.createTable(HoodieHiveCatalog.java:480)
> at 
> org.apache.flink.table.catalog.CatalogManager.lambda$createTable$10(CatalogManager.java:661)
> at 
> org.apache.flink.table.catalog.CatalogManager.execute(CatalogManager.java:841)
> ... 22 moreCaused by: java.lang.NullPointerExceptionat 
> java.util.HashMap.merge(HashMap.java:1225)at 
> java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)at 
> java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1699)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)  
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.hudi.table.catalog.TableOptionProperties.translateFlinkTableProperties2Spark(TableOptionProperties.java:191)
> at 
> org.apache.hudi.table.catalog.HoodieHiveCatalog.instantiateHiveTable(HoodieHiveCatalog.java:610)
> at 
> org.apache.hudi.table.catalog.HoodieHiveCatalog.createTable(HoodieHiveCatalog.java:469)
> ... 24 more {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-5760) Make sure DeleteBlock doesn't use Kryo for serialization to disk

2023-12-13 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5760.
-
Fix Version/s: 0.14.0
   (was: 0.14.1)
   Resolution: Fixed

> Make sure DeleteBlock doesn't use Kryo for serialization to disk
> 
>
> Key: HUDI-5760
> URL: https://issues.apache.org/jira/browse/HUDI-5760
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 1.0.0-beta1
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The problem is that serialization of the `HoodieDeleteBlock` is generated 
> dynamically by Kryo that could change whenever any class comprising it 
> changes.
> We've been bitten by this already twice:
> HUDI-5758
> HUDI-4959
>  
> Instead, anything that is persisted on disk have to be serialized using 
> hard-coded methods (same way HoodieDataBlock are serailized)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7228) Close LogFileReaders eaglerly with LogRecordReader

2023-12-13 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7228:
-

 Summary: Close LogFileReaders eaglerly with LogRecordReader
 Key: HUDI-7228
 URL: https://issues.apache.org/jira/browse/HUDI-7228
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7206) Fix auto deletion of MDT

2023-12-10 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7206.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Fix auto deletion of MDT
> 
>
> Key: HUDI-7206
> URL: https://issues.apache.org/jira/browse/HUDI-7206
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> With 014.0, we are triggering deletion of mdt and updating hoodie.properties 
> even if its already disabled and not present. 
>  
> {code:java}
> private boolean shouldExecuteMetadataTableDeletion() {
>   // Only execute metadata table deletion when all the following conditions 
> are met
>   // (1) This is data table
>   // (2) Metadata table is disabled in HoodieWriteConfig for the writer
>   return !metaClient.isMetadataTable()
>   && !config.isMetadataTableEnabled();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7206) Fix auto deletion of MDT

2023-12-09 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7206:
-

 Summary: Fix auto deletion of MDT
 Key: HUDI-7206
 URL: https://issues.apache.org/jira/browse/HUDI-7206
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


With 014.0, we are triggering deletion of mdt and updating hoodie.properties 
even if its already disabled and not present. 

 
{code:java}
private boolean shouldExecuteMetadataTableDeletion() {
  // Only execute metadata table deletion when all the following conditions are 
met
  // (1) This is data table
  // (2) Metadata table is disabled in HoodieWriteConfig for the writer
  return !metaClient.isMetadataTable()
  && !config.isMetadataTableEnabled();
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7205) Optimize MDT table deletion

2023-12-09 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7205:
-

 Summary: Optimize MDT table deletion
 Key: HUDI-7205
 URL: https://issues.apache.org/jira/browse/HUDI-7205
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


Hudi tries to honor MDT disablement as part of every write. But the deletion is 
triggered everytime even if the table does not exist and all configs are 
already disabled. 

 

This results in updating hoodie.properties repeatedly and can run into 
concurrency issues. 

 
{code:java}
23/12/07 04:34:32 ERROR DagScheduler: Exception executing node
org.apache.hudi.exception.HoodieIOException: Error updating table configs.
        at 
org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:445)
        at 
org.apache.hudi.common.table.HoodieTableConfig.update(HoodieTableConfig.java:454)
        at 
org.apache.hudi.common.table.HoodieTableConfig.setMetadataPartitionState(HoodieTableConfig.java:780)
        at 
org.apache.hudi.common.table.HoodieTableConfig.clearMetadataPartitions(HoodieTableConfig.java:811)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:1412)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:289)
        at 
org.apache.hudi.table.HoodieTable.maybeDeleteMetadataTable(HoodieTable.java:953)
        at 
org.apache.hudi.table.HoodieSparkTable.getMetadataWriter(HoodieSparkTable.java:116)
        at 
org.apache.hudi.table.HoodieTable.getMetadataWriter(HoodieTable.java:905)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:360)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:286)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:236)
        at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:104){code}
{code:java}
        at 
org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:826)
        at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:450)
        at 
org.apache.hudi.integ.testsuite.HoodieDeltaStreamerWrapper.upsert(HoodieDeltaStreamerWrapper.java:48)
        at 
org.apache.hudi.integ.testsuite.HoodieDeltaStreamerWrapper.insert(HoodieDeltaStreamerWrapper.java:52)
        at 
org.apache.hudi.integ.testsuite.HoodieInlineTestSuiteWriter.insert(HoodieInlineTestSuiteWriter.java:111)
        at 
org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.ingest(InsertNode.java:70)
        at 
org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:53)
        at 
org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:135)
        at 
org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:104)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
s3a://jenkins-infra-hudi/hudi/job-run/HudiIntegTestsDeltastreamerAsyncManualEKS/data/2023-12-07/30/MERGE_ON_READdeltastreamer-non-partitioned.yamltest-nonpartitioned.properties/91/output/.hoodie/hoodie.properties
 already exists
        at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:813)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195){code}
{code:java}
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1064)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$create$2(HoodieWrapperFileSystem.java:238)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.create(HoodieWrapperFileSystem.java:237)
        at 
org.apache.hudi.common.table.HoodieTableConfig.recoverIfNeeded(HoodieTableConfig.java:389)
        at 
org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:410)
        ... 26 more
23/12/07 04:34:32 INFO DagScheduler: Forcing shutdown of executor service, this 
might kill running tasks
23/12/07 04:34:32 ERROR HoodieTestSuiteJob: Failed to run Test Suite 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieIOException: Error updating table configs.
        at

[jira] [Created] (HUDI-7199) Optimize instantsAsStream in HoodieDefaultTimeline

2023-12-08 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7199:
-

 Summary: Optimize instantsAsStream in HoodieDefaultTimeline
 Key: HUDI-7199
 URL: https://issues.apache.org/jira/browse/HUDI-7199
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


Optimize instantsAsStream in HoodieDefaultTimeline



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7188) Master is failing due to test failure Dec 6, 2023

2023-12-06 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7188:
-

 Summary: Master is failing due to test failure Dec 6, 2023
 Key: HUDI-7188
 URL: https://issues.apache.org/jira/browse/HUDI-7188
 Project: Apache Hudi
  Issue Type: Improvement
  Components: tests-ci
Reporter: sivabalan narayanan


AFter this patch, master is broken 

[https://github.com/apache/hudi/pull/9667]

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7187) Fix integ test props to honor new streamer properties

2023-12-06 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7187:
-

 Summary: Fix integ test props to honor new streamer properties 
 Key: HUDI-7187
 URL: https://issues.apache.org/jira/browse/HUDI-7187
 Project: Apache Hudi
  Issue Type: Improvement
  Components: tests-ci
Reporter: sivabalan narayanan


As of now, all integ tests properties file are holding deltastreamer props. we 
need to change them to streamer props. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-7051) Incorrect replace operation in compaction strategy filter

2023-12-05 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793494#comment-17793494
 ] 

sivabalan narayanan commented on HUDI-7051:
---

hey [~vmaster] : 
sorry I am bit confused. 

as per master, filterPartitionPaths in DayBasedCompactionStrategy is as below 



 
{code:java}
@Override
public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
  return allPartitionPaths.stream().sorted(comparator)
  .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),
  writeConfig.getTargetPartitionsPerDayBasedCompaction()));
} {code}
 

 

Only in 

BoundedPartitionAwareCompactionStrategy.filterPartitionPaths I see the replace 
operations. 

But can you help me understand whats the issue in there. I understand 
"dllr_date=2023/10/10" may not be an actual partition present physcially, but 
thats interim state used for comparison and later we switch it back. 

 

in other words. 

if original partition is hypehnated. 

 

dllr_date=2023-10-10 -> gets converted to "dllr_date=2023/10/10", and then 
comparisons are performed to sort them and then converted back to 
dllr_date=2023-10-10. So, not sure where is the bug here. can you throw some 
light please

 

> Incorrect replace operation in compaction strategy filter
> -
>
> Key: HUDI-7051
> URL: https://issues.apache.org/jira/browse/HUDI-7051
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction
>Reporter: vmaster.cc
>Priority: Major
> Attachments: image-2023-11-08-16-01-46-166.png, 
> image-2023-11-08-16-02-39-291.png
>
>
> There are some incorrect replace operation to sort all partition paths.
> {code:java}
> return allPartitionPaths.stream().map(partition -> partition.replace("/", 
> "-"))
> .sorted(Comparator.reverseOrder()).map(partitionPath -> 
> partitionPath.replace("-", "/")) {code}
> the hive partition before replace is dllr_date=2023-10-10, then after will 
> convert to dllr_date=2023/10/10, this is an incorrect partition.
>  # org.apache.hudi.table.action.compact.strategy.DayBasedCompactionStrategy
>  # 
> org.apache.hudi.table.action.compact.strategy.BoundedPartitionAwareCompactionStrategy
>  # 
> org.apache.hudi.table.action.compact.strategy.UnBoundedPartitionAwareCompactionStrategy
> !image-2023-11-08-16-02-39-291.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch

2023-12-04 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7154:
-

Assignee: sivabalan narayanan

> Hudi Streamer with row writer enabled hits NPE with empty batch
> ---
>
> Key: HUDI-7154
> URL: https://issues.apache.org/jira/browse/HUDI-7154
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Hudi Streamer with row writer enabled hits NPE with empty batch (the 
> checkpoint has advanced)
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458)
>   at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch

2023-12-04 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7154.
-
Resolution: Fixed

> Hudi Streamer with row writer enabled hits NPE with empty batch
> ---
>
> Key: HUDI-7154
> URL: https://issues.apache.org/jira/browse/HUDI-7154
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Hudi Streamer with row writer enabled hits NPE with empty batch (the 
> checkpoint has advanced)
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458)
>   at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6980) Spark job stuck after completion, due to some non daemon threads still running

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-6980:
-

Assignee: sivabalan narayanan

> Spark job stuck after completion, due to some non daemon threads still 
> running 
> ---
>
> Key: HUDI-6980
> URL: https://issues.apache.org/jira/browse/HUDI-6980
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Github Issue - [https://github.com/apache/hudi/issues/9826]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6958) Update Schema Evolution Documentation

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6958.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Update Schema Evolution Documentation
> -
>
> Key: HUDI-6958
> URL: https://issues.apache.org/jira/browse/HUDI-6958
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Update the schema evolution page to document the new changes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6961) Deletes with custom delete field not working with DefaultHoodieRecordPayload

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6961.
-
Resolution: Fixed

> Deletes with custom delete field not working with DefaultHoodieRecordPayload
> 
>
> Key: HUDI-6961
> URL: https://issues.apache.org/jira/browse/HUDI-6961
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When configuring custom delete key and delete marker with 
> DefaultHoodieRecordPayload, writing fails with the deletes in the batch:
> {code:java}
> Error for key:HoodieKey { recordKey=0 partitionPath=} is 
> java.util.NoSuchElementException: No value present in Option
>   at org.apache.hudi.common.util.Option.get(Option.java:89)
>   at 
> org.apache.hudi.common.model.HoodieAvroRecord.prependMetaFields(HoodieAvroRecord.java:132)
>   at 
> org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:144)
>   at 
> org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:180)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42)
>   at 
> org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:69)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>   at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6979) support EventTimeBasedCompactionStrategy

2023-11-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791711#comment-17791711
 ] 

sivabalan narayanan commented on HUDI-6979:
---

this will definitely be a good addition

 

> support EventTimeBasedCompactionStrategy
> 
>
> Key: HUDI-6979
> URL: https://issues.apache.org/jira/browse/HUDI-6979
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
>
> The current compaction strategies are based on the logfile size, the number 
> of logfile files, etc. The data time of the RO table generated by these 
> strategies is uncontrollable. Hudi also has a DayBased strategy, but it 
> relies on day based partition path and the time granularity is coarse.
> The *EventTimeBasedCompactionStrategy* strategy can generate event 
> time-friendly RO tables, whether it is day based partition or not. For 
> example, the strategy can select all logfiles whose data time is before 3 am 
> for compaction, so that the generated RO table data is before 3 am. If we 
> just want to query data before 3 am, we can just query the RO table which is 
> much faster.
> With the strategy, I think we can expand the application scenarios of RO 
> tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6999) Add row writer support to Deltastreamer

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6999.
-
Fix Version/s: 1.1.0
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Add row writer support to Deltastreamer
> ---
>
> Key: HUDI-6999
> URL: https://issues.apache.org/jira/browse/HUDI-6999
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> We have not yet leveraged row writer support in Deltastreamer. we can benefit 
> from perf improvement if we can integrate



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7000) Fix HoodieActiveTimeline::deleteInstantFileIfExists not show the file path when occur delete not success

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7000.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix HoodieActiveTimeline::deleteInstantFileIfExists not show the file path 
> when occur delete not success
> 
>
> Key: HUDI-7000
> URL: https://issues.apache.org/jira/browse/HUDI-7000
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Fix HoodieActiveTimeline::deleteInstantFileIfExists not show the file path 
> when occur delete not success
> when some instants delete is not success，only report the failed instant 
> without the path，but users need the path to get more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7002) Add support for non-partitioned dataset w/ RLI

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7002.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Add support for non-partitioned dataset w/ RLI
> --
>
> Key: HUDI-7002
> URL: https://issues.apache.org/jira/browse/HUDI-7002
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We need to support RLI w/ non-partitioned datasets as well. 
> both initialization of RLI for an existing table and for new tables



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7006) reduce unnecessary isEmpty checks in StreamSync

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7006.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> reduce unnecessary isEmpty checks in StreamSync
> ---
>
> Key: HUDI-7006
> URL: https://issues.apache.org/jira/browse/HUDI-7006
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7004) Add support of snapshotLoadQuerySplitter in s3/gcs sources

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7004.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

> Add support of snapshotLoadQuerySplitter in s3/gcs sources
> --
>
> Key: HUDI-7004
> URL: https://issues.apache.org/jira/browse/HUDI-7004
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7003) Add option to fallback to full table scan for s3/gcs sources

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7003.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Add option to fallback to full table scan for s3/gcs sources
> 
>
> Key: HUDI-7003
> URL: https://issues.apache.org/jira/browse/HUDI-7003
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7009) Filter out null value records from avro kafka source

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7009.
-
Resolution: Fixed

> Filter out null value records from avro kafka source
> 
>
> Key: HUDI-7009
> URL: https://issues.apache.org/jira/browse/HUDI-7009
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7014) Follow up HUDI-6975, optimize the code of BoundedPartitionAwareCompactionStrategy

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7014.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Not A Problem

> Follow up HUDI-6975, optimize the code of 
> BoundedPartitionAwareCompactionStrategy
> -
>
> Key: HUDI-7014
> URL: https://issues.apache.org/jira/browse/HUDI-7014
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kwang
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: compaction, pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7054) ShowPartitionsCommand should consider lazy delete_partitions

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7054.
-
Fix Version/s: 0.14.1
 Assignee: Hui An
   Resolution: Fixed

> ShowPartitionsCommand should consider lazy delete_partitions
> 
>
> Key: HUDI-7054
> URL: https://issues.apache.org/jira/browse/HUDI-7054
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7052) Fix partition key validation for key generators.

2023-11-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7052.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix partition key validation for key generators.
> 
>
> Key: HUDI-7052
> URL: https://issues.apache.org/jira/browse/HUDI-7052
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 7700 matches

Mail list logo