date:20230109

[jira] [Created] (HUDI-5523) Support force rollback to a history instant

2023-01-09 Thread Danny Chen (Jira)

Danny Chen created HUDI-5523:


 Summary: Support force rollback to a history instant
 Key: HUDI-5523
 URL: https://issues.apache.org/jira/browse/HUDI-5523
 Project: Apache Hudi
  Issue Type: New Feature
  Components: hudi-utilities
Reporter: Danny Chen


Currently, the instant can be rolled back one by one, say we have 4 instants 
here:

instant1, instant2, instant3, instant4

 

If we want to roll back to instant1, instant4 should be rolled back firstly, 
then instant 3 and so forth.

One demand/request here is to support to roll back to instant1 directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #7633: Fix Deletes issued without any prior commits

2023-01-09 Thread GitBox



hudi-bot commented on PR #7633:
URL: https://github.com/apache/hudi/pull/7633#issuecomment-1376862712

   
   ## CI report:
   
   * d10474141c5493f9ec6712da8c8f4fff595e94cf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7631: [MINOR] Remove useless RollbackTimeline

2023-01-09 Thread GitBox



hudi-bot commented on PR #7631:
URL: https://github.com/apache/hudi/pull/7631#issuecomment-1376862673

   
   ## CI report:
   
   * 4e4887acea4e6197e6822eafd0aabf48a9378d78 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14209)
 
   * 0497dd4ecacc4f85c7c0fb138e075d89d1296a8b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7615: [HUDI-5510] Reload active timeline when getInstantsToArchive

2023-01-09 Thread GitBox



hudi-bot commented on PR #7615:
URL: https://github.com/apache/hudi/pull/7615#issuecomment-1376862564

   
   ## CI report:
   
   * c93470f2891c96f71c859677ad132c24f2eb373e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14213)
 
   * 1b42230e664a1f4554bc072e3198ee4f2ec8f32e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14215)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7372: [HUDI-5326] Fix clustering group building in SparkSizeBasedClusteringPlanStrategy

2023-01-09 Thread GitBox



hudi-bot commented on PR #7372:
URL: https://github.com/apache/hudi/pull/7372#issuecomment-1376862113

   
   ## CI report:
   
   * 46b644c42fca50efc48bde48a4a4e32c05aa50c9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14208)
 
   * 40f9031fcc3215124deffc83d2b0f8eb55440374 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7365: [HUDI-5317] Fix insert overwrite table for partitioned table

2023-01-09 Thread GitBox



hudi-bot commented on PR #7365:
URL: https://github.com/apache/hudi/pull/7365#issuecomment-1376862030

   
   ## CI report:
   
   * b8793478965fff04d0df199741ce28909d6695e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14186)
 
   * 8225ee2bab3f4d4a208a763c9df124eb7a85c0ce Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14212)
 
   * b4a9d1ac22e37b14012e4c85925347bafe8bbda8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7615: [HUDI-5510] Reload active timeline when getInstantsToArchive

2023-01-09 Thread GitBox



hudi-bot commented on PR #7615:
URL: https://github.com/apache/hudi/pull/7615#issuecomment-1376856826

   
   ## CI report:
   
   * 6d63669cac8191009b6fc7df2e9ff768463f00d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14145)
 
   * c93470f2891c96f71c859677ad132c24f2eb373e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14213)
 
   * 1b42230e664a1f4554bc072e3198ee4f2ec8f32e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7372: [HUDI-5326] Fix clustering group building in SparkSizeBasedClusteringPlanStrategy

2023-01-09 Thread GitBox



hudi-bot commented on PR #7372:
URL: https://github.com/apache/hudi/pull/7372#issuecomment-1376856211

   
   ## CI report:
   
   * 46b644c42fca50efc48bde48a4a4e32c05aa50c9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #7487: [SUPPORT] S3 Buckets reached quota limit when reading from hudi tables

2023-01-09 Thread GitBox



yihua commented on issue #7487:
URL: https://github.com/apache/hudi/issues/7487#issuecomment-1376842456

   503 errors mean that the throttling limit of S3 requests is hit, causing 
backlog or timeout, making the jobs fail easily.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #7487: [SUPPORT] S3 Buckets reached quota limit when reading from hudi tables

2023-01-09 Thread GitBox



yihua commented on issue #7487:
URL: https://github.com/apache/hudi/issues/7487#issuecomment-1376840871

   Hi @AdarshKadameriTR to fully understand where these S3 requests / API calls 
come from, you should enable S3 request logs by setting 
`log4j.logger.com.amazonaws.request=DEBUG` in log4j properties file and the 
following Spark configs:
   ```
   --conf 
spark.driver.extraJavaOptions="-Dlog4j.configuration=file://s3-debug.log4j.properties"
   --conf 
spark.executor.extraJavaOptions="-Dlog4j.configuration=file://s3-debug.log4j.properties"
   ```
   Then in the driver and executor logs you'll see request logs like:
   ```
   DEBUG request: Sending Request: GET 
https://.s3.us-east-2.amazonaws.com / Parameters: 
({"list-type":["2"],"delimiter":["/"],"max-keys":["5000"],"prefix":["table/.hoodie/metadata/.hoodie/"],"fetch-owner":["false"]}Headers:
 ... 
   ```
   This helps you understand which prefix / directory triggers the most S3 
requests, and then we can dig deeper into why that's happening.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on pull request #6926: [HUDI-3676] Enhance tests for trigger clean every Nth commit

2023-01-09 Thread GitBox



pratyakshsharma commented on PR #6926:
URL: https://github.com/apache/hudi/pull/6926#issuecomment-1376839387

   this is good for another pass @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #7607: [HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands

2023-01-09 Thread GitBox



yihua commented on code in PR #7607:
URL: https://github.com/apache/hudi/pull/7607#discussion_r1065408280


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -81,10 +80,8 @@ trait ProvidesHoodieConfig extends Logging {
 HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key -> 
tableConfig.getPartitionFieldProp,
 HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key -> 
hiveSyncConfig.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS),
 HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE.key -> 
hiveSyncConfig.getBoolean(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE).toString,
-HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key -> 
hoodieProps.getString(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "200"),

Review Comment:
   Definitely, we should not hardcode any defaults here.  As long as the 
configs take effect, I'm good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liaotian1005 opened a new pull request, #7633: Deletes issued without any prior commits

2023-01-09 Thread GitBox



liaotian1005 opened a new pull request, #7633:
URL: https://github.com/apache/hudi/pull/7633

create table hudi_cow_nonpcf_tbl (
 uuid int,
   name string,
price double
) using hudi;
   delete from hudi_cow_nonpcf_tbl where uuid = 1;
   throw exception
   org.apache.hudi.exception.HoodieIOException: Deletes issued without any 
prior commits
at 
org.apache.hudi.client.BaseHoodieWriteClient.setWriteSchemaForDeletes(BaseHoodieWriteClient.java:1509)
   
   An error occurs when the table does not have any instants for delelte 
statements 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #7545: [SUPPORT]How to sync data from Kafka to Hudi when use Flink SQL canal-json format

2023-01-09 Thread GitBox



yihua commented on issue #7545:
URL: https://github.com/apache/hudi/issues/7545#issuecomment-1376827572

   @With-winds I'm curious, did you find out a solution?  If so, would you mind 
sharing the knowledge so others can benefit from it too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5522) Improve docs for disaster recovery

2023-01-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5522:

Fix Version/s: 0.13.0

> Improve docs for disaster recovery
> --
>
> Key: HUDI-5522
> URL: https://issues.apache.org/jira/browse/HUDI-5522
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>
> Related issue: [https://github.com/apache/hudi/issues/7589]
> [https://hudi.apache.org/docs/disaster_recovery]
>  
> For savepointing, user can use Hudi CLI, Spark SQL procedures, and write 
> client API.  We should give examples in the Disaster Recovery page.
> Also, we need to update docs of create_savepoints procedure, as the 
> information is not accurate. 
> https://hudi.apache.org/docs/procedures#create_savepoints



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yihua commented on issue #7589: Keep only clustered file(all) after cleaning

2023-01-09 Thread GitBox



yihua commented on issue #7589:
URL: https://github.com/apache/hudi/issues/7589#issuecomment-1376825649

   Hi @maheshguptags you can also create savepoints using [Spark SQL 
procedures](https://hudi.apache.org/docs/0.11.1/procedures#create_savepoints): 
`call create_savepoints(table => 'hudi_trips_cow', commit_Time => 
'20230110224424600')` .  We'll update the docs 
([HUDI-5522](https://issues.apache.org/jira/browse/HUDI-5522)).
   
   If you're comfortable using Hudi APIs, you may also call 
[SparkRDDWriteClient#savepoint](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java#L716)
 to programmatically create savepoints.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5522) Improve docs for disaster recovery

2023-01-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5522:

Description: 
Related issue: [https://github.com/apache/hudi/issues/7589]

[https://hudi.apache.org/docs/disaster_recovery]

 

For savepointing, user can use Hudi CLI, Spark SQL procedures, and write client 
API.  We should give examples in the Disaster Recovery page.

Also, we need to update docs of create_savepoints procedure, as the information 
is not accurate. https://hudi.apache.org/docs/procedures#create_savepoints

> Improve docs for disaster recovery
> --
>
> Key: HUDI-5522
> URL: https://issues.apache.org/jira/browse/HUDI-5522
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> Related issue: [https://github.com/apache/hudi/issues/7589]
> [https://hudi.apache.org/docs/disaster_recovery]
>  
> For savepointing, user can use Hudi CLI, Spark SQL procedures, and write 
> client API.  We should give examples in the Disaster Recovery page.
> Also, we need to update docs of create_savepoints procedure, as the 
> information is not accurate. 
> https://hudi.apache.org/docs/procedures#create_savepoints



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-5522) Improve docs for disaster recovery

2023-01-09 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-5522:
---

 Summary: Improve docs for disaster recovery
 Key: HUDI-5522
 URL: https://issues.apache.org/jira/browse/HUDI-5522
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] JerryYue-M commented on pull request #6134: WIP:refactor hoodie stream source based flip-27 and support watermark

2023-01-09 Thread GitBox



JerryYue-M commented on PR #6134:
URL: https://github.com/apache/hudi/pull/6134#issuecomment-1376800563

   > what's the status of this pr right now?
   This is OnGoing
   this will rebase with the master and add some ut tests later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7631: [MINOR] Remove useless RollbackTimeline

2023-01-09 Thread GitBox



hudi-bot commented on PR #7631:
URL: https://github.com/apache/hudi/pull/7631#issuecomment-137671

   
   ## CI report:
   
   * 4e4887acea4e6197e6822eafd0aabf48a9378d78 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6133: [HUDI-1575] Early Conflict Detection For Multi-writer

2023-01-09 Thread GitBox



hudi-bot commented on PR #6133:
URL: https://github.com/apache/hudi/pull/6133#issuecomment-1376790832

   
   ## CI report:
   
   * dbe3db845908d261baa5a1aa71d19e0db55816de UNKNOWN
   * 678cce4a9748cb54a90a559384a0cb0443082535 UNKNOWN
   * 6fc5bf1ce7921bf25acc3659565457264d8b9dc2 UNKNOWN
   * 0b74647767677a4cc1193295b493dc0537dd4c96 UNKNOWN
   * 3369e5e8770cf9eb4c4d272f7c3af54933c992aa UNKNOWN
   * 1ccecb4fa727cc254cf4780012c28bab24e6afde UNKNOWN
   * 6fdf901df1086d6ecc07c7987b6a3212b08eaefb UNKNOWN
   * 1b837ec65943331f6c805f723b088240ae66dfaf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14207)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14214)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #7627: [HUDI-5517] HoodieTimeline support filter instants by state transition time

2023-01-09 Thread GitBox



danny0405 commented on code in PR #7627:
URL: https://github.com/apache/hudi/pull/7627#discussion_r1065348765


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java:
##
@@ -46,14 +55,35 @@ public class HoodieInstant implements Serializable, 
Comparable {
   public static final Comparator COMPARATOR = 
Comparator.comparing(HoodieInstant::getTimestamp)
   .thenComparing(ACTION_COMPARATOR).thenComparing(HoodieInstant::getState);
 
+  public static final Comparator STATE_TRANSITION_COMPARATOR =
+  Comparator.comparing(HoodieInstant::getStateTransitionTime)
+  .thenComparing(HoodieInstant::getTimestamp)
+  
.thenComparing(ACTION_COMPARATOR).thenComparing(HoodieInstant::getState);
+
   public static String getComparableAction(String action) {
 return COMPARABLE_ACTIONS.getOrDefault(action, action);
   }
 
+  public static String extractTimestamp(String fileName) throws 
IllegalArgumentException {
+Objects.requireNonNull(fileName);
+
+Matcher matcher = NAME_FORMAT.matcher(fileName);
+if (matcher.find()) {
+  return matcher.group(1);
+}
+
+throw new IllegalArgumentException(String.format(FILE_NAME_FORMAT_ERROR, 
fileName));
+  }
+
   public static String getTimelineFileExtension(String fileName) {
 Objects.requireNonNull(fileName);
-int dotIndex = fileName.indexOf('.');
-return dotIndex == -1 ? "" : fileName.substring(dotIndex);
+
+Matcher matcher = NAME_FORMAT.matcher(fileName);

Review Comment:
   Is the regex expr match efficient enough here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhangyue19921010 commented on pull request #6133: [HUDI-1575] Early Conflict Detection For Multi-writer

2023-01-09 Thread GitBox



zhangyue19921010 commented on PR #6133:
URL: https://github.com/apache/hudi/pull/6133#issuecomment-1376756472

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] racc commented on issue #6808: [SUPPORT] Cannot sync to spark embedded derby hive meta store (the default one)

2023-01-09 Thread GitBox



racc commented on issue #6808:
URL: https://github.com/apache/hudi/issues/6808#issuecomment-1376748977

   ```java
   public class MetaStoreTxnDbUtilPrep extends MetaStoreInitListener {
   public MetaStoreTxnDbUtilPrep(Configuration config) {
   super(config);
   }
   
   @Override
   public void onInit(MetaStoreInitContext metaStoreInitContext) {
   try {
   TxnDbUtil.prepDb(new HiveConf(this.getConf(), HiveConf.class));
   } catch (Exception e) {
   throw new RuntimeException(e);
   }
   }
   }

   // Psuedocode for setting Spark conf 
   .config("javax.jdo.option.ConnectionURL", 
"jdbc:derby:;databaseName=APP;create=true")
   .config("datanucleus.schema.autoCreateTables", "true")
   .config("datanucleus.autoStartMechanism", "SchemaTable")
   .config("hive.metastore.schema.verification", "false")
   .config("hive.txn.strict.locking.mode", "false")
   .config("hive.metastore.init.hooks", 
MetaStoreTxnDbUtilPrep.class.getCanonicalName())
   .config("spark.sql.warehouse.dir", )
   
   System.setProperty("derby.system.home", )
   
   ```
   FYI I was able to have some success with getting Iceberg working with the 
above settings - used a MetaStore init hook to ensure Transaction related 
tables were created...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6133: [HUDI-1575] Early Conflict Detection For Multi-writer

2023-01-09 Thread GitBox



hudi-bot commented on PR #6133:
URL: https://github.com/apache/hudi/pull/6133#issuecomment-1376737665

   
   ## CI report:
   
   * dbe3db845908d261baa5a1aa71d19e0db55816de UNKNOWN
   * 678cce4a9748cb54a90a559384a0cb0443082535 UNKNOWN
   * 6fc5bf1ce7921bf25acc3659565457264d8b9dc2 UNKNOWN
   * 0b74647767677a4cc1193295b493dc0537dd4c96 UNKNOWN
   * 3369e5e8770cf9eb4c4d272f7c3af54933c992aa UNKNOWN
   * 1ccecb4fa727cc254cf4780012c28bab24e6afde UNKNOWN
   * 6fdf901df1086d6ecc07c7987b6a3212b08eaefb UNKNOWN
   * 1b837ec65943331f6c805f723b088240ae66dfaf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14207)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7605: [HUDI-5349] Clean up partially failed restore

2023-01-09 Thread GitBox



hudi-bot commented on PR #7605:
URL: https://github.com/apache/hudi/pull/7605#issuecomment-1376735097

   
   ## CI report:
   
   * 5c77de2450baf4d3dcb153ee53a57e006926a612 UNKNOWN
   * 6eacbe50e932dfef39439f3f025cdb946326c180 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kepplertreet commented on issue #7628: [SUPPORT] Hudi Metadata Column Stats Fail

2023-01-09 Thread GitBox



kepplertreet commented on issue #7628:
URL: https://github.com/apache/hudi/issues/7628#issuecomment-1376709063

   Hi @alexeykudinkin  
   
   We are using an integer id column as  the 
`hoodie.datasource.write.recordkey.field` 
   Listing a few sample values
   ```
   1263633528
   1263633530
   1263633531 
   ``` 
   
   As for the Hud configurations we are using : 
   ```
   0hoodie.table.version
  5
   1   hoodie.datasource.write.operation
 upsert
   2 hoodie.datasource.write.hive_style_partitioning
   true
   3hoodie.datasource.write.precombine.field
_commit_time_ms
   4   hoodie.datasource.write.commitmeta.key.prefix
  _
   5  hoodie.datasource.write.insert.drop.duplicates
   true
   6  hoodie.datasource.hive_sync.enable
   true
   7hoodie.datasource.hive_sync.use_jdbc
   true
   8hoodie.datasource.hive_sync.auto_create_database
   true
   9   hoodie.datasource.hive_sync.support_timestamp
  false
   10 hoodie.datasource.hive_sync.skip_ro_suffix
   true
   11   hoodie.parquet.compression.codec
 snappy
   12  hoodie.metrics.on
  false
   13   hoodie.metrics.reporter.type
 PROMETHEUS_PUSHGATEWAY
   14hoodie.metrics.pushgateway.host
   
   15hoodie.metrics.pushgateway.port
   
   16  hoodie.metrics.pushgateway.random.job.name.suffix
  false
   17   hoodie.metrics.pushgateway.report.period.seconds
 30
   18 hoodie.metadata.enable
   true
   19 hoodie.metadata.metrics.enable
   true
   20hoodie.metadata.clean.async
   true
   21  hoodie.metadata.index.column.stats.enable
   true
   22  hoodie.metadata.index.bloom.filter.enable
   true
   23hoodie.metadata.index.async
   true
   24  hoodie.write.concurrency.mode 
OPTIMISTIC_CONCURRENCY_CONTROL
   25 hoodie.write.lock.provider  
org.apache.hudi.client.transaction.lock.FileSy...
   26  hoodie.datasource.compaction.async.enable
   true
   27 hoodie.compact.schedule.inline
  false
   28 hoodie.compact.inline.trigger.strategy
NUM_COMMITS
   29hoodie.compact.inline.max.delta.commits
  2
   30  hoodie.index.type
  BLOOM
   31hoodie.cleaner.policy.failed.writes
   LAZY
   32 hoodie.clean.automatic
   true
   33 hoodie.clean.async
   true
   34hoodie.cleaner.commits.retained
  4
   35   hoodie.write.lock.client.num_retries
 10
   36   hoodie.write.lock.wait_time_ms_between_retry
   1000
   37  hoodie.write.lock.num_retries
 15
   38 hoodie.write.lock.wait_time_ms
  6
   39  hoodie.write.lock.zookeeper.connection_timeout_ms
  15000
   40hoodie.bloom.index.use.metadata
   true
   41   hoodie.archive.async
   true
   42   hoodie.parquet.max.file.size

[GitHub] [hudi] davidshtian commented on issue #7591: [SUPPORT] Kinesis Data Analytics Flink1.13 to HUDI

2023-01-09 Thread GitBox



davidshtian commented on issue #7591:
URL: https://github.com/apache/hudi/issues/7591#issuecomment-1376707180

   From the logs _java.util.concurrent.CompletionException: 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: zeppelin-flink/172.20.189.165:8082_, it seems that either 
the Flink cluster was not started at all, or it might be a network issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on pull request #7582: [HUDI-5488]Make sure Disrupt queue start first, then insert records

2023-01-09 Thread GitBox



boneanxs commented on PR #7582:
URL: https://github.com/apache/hudi/pull/7582#issuecomment-1376703720

   Gentle ping @alexeykudinkin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] davidshtian commented on issue #7591: [SUPPORT] Kinesis Data Analytics Flink1.13 to HUDI

2023-01-09 Thread GitBox



davidshtian commented on issue #7591:
URL: https://github.com/apache/hudi/issues/7591#issuecomment-1376687802

   @soumilshah1995 I tried again, it worked as below, for your reference. 
Thanks~
   
   **Step 1 – Kinesis Stream**
   https://user-images.githubusercontent.com/14228056/211456528-eb52b665-05aa-4cca-a44f-30b0d1948ab8.png";>
   
   **Step 2 – Jars**
   https://user-images.githubusercontent.com/14228056/211456562-92dcd98e-9964-4db9-be71-5db6e9b3ef40.png";>
   https://user-images.githubusercontent.com/14228056/211456572-986a2a4a-8b2e-47bd-a205-72e265d359ae.png";>
   
   **Step 3 – KDA Studio**
   https://user-images.githubusercontent.com/14228056/211456602-8813ca24-4b81-4b41-bcf6-c4cf31f64556.png";>
   
   **Step 4 – Executing the code**
   https://user-images.githubusercontent.com/14228056/211456628-08f24462-54f1-453a-b46c-80c767752c25.png";>
   https://user-images.githubusercontent.com/14228056/211456638-d1f6d477-4ec2-4d6d-bcfd-4d53581e1ad4.png";>
   https://user-images.githubusercontent.com/14228056/211456650-7e0f8a22-83d0-456f-a309-2d46cadecd89.png";>
   https://user-images.githubusercontent.com/14228056/211456661-f2f95408-6bf8-4361-9e3d-d50974728b78.png";>
   https://user-images.githubusercontent.com/14228056/211456668-9a1ec0c1-bce5-4cbe-8607-2a98d4527741.png";>
   https://user-images.githubusercontent.com/14228056/211456684-dfb720af-1cd5-4da9-ac2a-12c28283d4ee.png";>
   https://user-images.githubusercontent.com/14228056/211456694-6b23055c-bba0-4077-b2de-bb21013448ce.png";>
   https://user-images.githubusercontent.com/14228056/211456708-bb8bc494-a0c8-438d-9d92-8e537023c6d3.png";>
   https://user-images.githubusercontent.com/14228056/211456716-9c4825e7-05a7-4d35-a4ef-8f966d39056e.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7615: [HUDI-5510] Reload active timeline when getInstantsToArchive

2023-01-09 Thread GitBox



hudi-bot commented on PR #7615:
URL: https://github.com/apache/hudi/pull/7615#issuecomment-1376686175

   
   ## CI report:
   
   * 6d63669cac8191009b6fc7df2e9ff768463f00d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14145)
 
   * c93470f2891c96f71c859677ad132c24f2eb373e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14213)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7365: [HUDI-5317] Fix insert overwrite table for partitioned table

2023-01-09 Thread GitBox



hudi-bot commented on PR #7365:
URL: https://github.com/apache/hudi/pull/7365#issuecomment-1376685879

   
   ## CI report:
   
   * b8793478965fff04d0df199741ce28909d6695e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14186)
 
   * 8225ee2bab3f4d4a208a763c9df124eb7a85c0ce Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14212)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7615: [HUDI-5510] Reload active timeline when getInstantsToArchive

2023-01-09 Thread GitBox



hudi-bot commented on PR #7615:
URL: https://github.com/apache/hudi/pull/7615#issuecomment-1376682774

   
   ## CI report:
   
   * 6d63669cac8191009b6fc7df2e9ff768463f00d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14145)
 
   * c93470f2891c96f71c859677ad132c24f2eb373e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7365: [HUDI-5317] Fix insert overwrite table for partitioned table

2023-01-09 Thread GitBox



hudi-bot commented on PR #7365:
URL: https://github.com/apache/hudi/pull/7365#issuecomment-1376682421

   
   ## CI report:
   
   * b8793478965fff04d0df199741ce28909d6695e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14186)
 
   * 8225ee2bab3f4d4a208a763c9df124eb7a85c0ce UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7607: [HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands

2023-01-09 Thread GitBox



hudi-bot commented on PR #7607:
URL: https://github.com/apache/hudi/pull/7607#issuecomment-1376679068

   
   ## CI report:
   
   * 32033e4a4ed91005a237aa88afa2c6adcb51169f UNKNOWN
   * 05cbda8ddcca0944c7965bd7c32448e29f97 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev commented on a diff in pull request #7626: [HUDI-5516] Reduce memory footprint on workload with thousand active partitions

2023-01-09 Thread GitBox



trushev commented on code in PR #7626:
URL: https://github.com/apache/hudi/pull/7626#discussion_r1065293780


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java:
##
@@ -449,6 +450,7 @@ private boolean flushBucket(DataBucket bucket) {
 
 this.eventGateway.sendEventToCoordinator(event);
 writeStatuses.addAll(writeStatus);
+writeClient.cleanHandle(bucket.fileID);
 return true;
   }

Review Comment:
   You mean do we need to clean all handles `this.writeClient.cleanHandles()` 
in `flushRemaining` as we have cleaned handle here?
   
   To be honest, I'm not sure. The handle map would hold unlimited handles. 
Even lightweight closed handles could exceed heap size. I guess LRU cache 
solves this problem but not sure about benefits of such approach here
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev commented on a diff in pull request #7626: [HUDI-5516] Reduce memory footprint on workload with thousand active partitions

2023-01-09 Thread GitBox



trushev commented on code in PR #7626:
URL: https://github.com/apache/hudi/pull/7626#discussion_r1065294170


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java:
##
@@ -94,7 +94,7 @@
* FileID to write handle mapping in order to record the write handles for 
each file group,
* so that we can append the mini-batch data buffer incrementally.
*/
-  private final Map> bucketToHandles;
+  private final Map bucketToHandles;
 

Review Comment:
   Thanks, I'll try to rework it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-5430) Fix multi-writer handling w/ rollback blocks in MOR table (log record reader)

2023-01-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5430:
-

Assignee: sivabalan narayanan  (was: Alexey Kudinkin)

> Fix multi-writer handling w/ rollback blocks in MOR table (log record reader)
> -
>
> Key: HUDI-5430
> URL: https://issues.apache.org/jira/browse/HUDI-5430
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Sample log blocks and commits
>  
> lb1_c1, lb2_c2, lb3_c3, lb4_c4, lb5_rb_c3, lb6_c5.
>  
> lb3 is expected to be considered invalid. but our current scan does not treat 
> it as invalid. While parsing the rollback block, we just check for the 
> previous log block for matching timestamp. since it does not match, we treat 
> lb5_rb block as invalid and move on.
>  
> in case of multi-writer this is definitely feasible. so we should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-5464) Fix instantiation of a new partition in MDT re-using the same instant time as a regular commit

2023-01-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5464:
-

Assignee: sivabalan narayanan  (was: Alexey Kudinkin)

> Fix instantiation of a new partition in MDT re-using the same instant time as 
> a regular commit
> --
>
> Key: HUDI-5464
> URL: https://issues.apache.org/jira/browse/HUDI-5464
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> we re-use the same instant time as the commit being applied to MDT while 
> instantiating a new partition in MDT. this needs to be fixed. 
>  
> for eg, lets say we have 10 commits w/ already FILES enabled. 
> for C11, we are enabling col-stats. 
> after data table business, when we enter metadata writer instantiation, we 
> deduct that col-stats has to be instantiated and then instantiate using DC11. 
> in MDT timeline, we see dc11.req. dc11.inflight and dc11.complete. and then 
> we go ahead and apply actual C11 from DT to MDT (dc11.inflight and 
> dc11.complete is updated). here, we overwrite the same DC11 w/ records 
> pertaining to C11. 
> which is buggy. we definitely need to fix this. 
> We can add a suffix to C11 (say C11_003 or C11_001) as we do for compaction 
> and clean in MDT so that any additional operation in MDT has a diff commit 
> time format. For everything else, it should match w/ DT 1 on 1. 
>  
>  
> Impact:
> We are over-riding the same DC for two purposes which is bad. if there is a 
> crash after initializing col-stats and before applying actual C11(in above 
> context), we might mistakenly rollback col-stats initialization, but still 
> table config could say that col stats is fully ready to be served. But while 
> reading MDT, we may not read DC11 since its a failed commit. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5521) Make sure MT Bloom Index partition is covered by tests for Bloom Index

2023-01-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5521:
--
Sprint: 0.13.0 Final Sprint 2

> Make sure MT Bloom Index partition is covered by tests for Bloom Index
> --
>
> Key: HUDI-5521
> URL: https://issues.apache.org/jira/browse/HUDI-5521
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> We need to make sure existing tests for Bloom Index do cover MT Bloom Index 
> pathway



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5423) Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)

2023-01-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5423:
--
Sprint: 0.13.0 Final Sprint  (was: 0.13.0 Final Sprint, 0.13.0 Final Sprint 
2)

> Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)
> 
>
> Key: HUDI-5423
> URL: https://issues.apache.org/jira/browse/HUDI-5423
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> {code}
> [ERROR] Tests run: 94, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
> 1,729.267 s <<< FAILURE! - in JUnit Vintage
> [ERROR] [8] 
> ColumnStatsTestCase(MERGE_ON_READ,true,true)(testMetadataColumnStatsIndex(ColumnStatsTestCase))
>   Time elapsed: 23.246 s  <<< FAILURE!
> org.opentest4j.AssertionFailedError: 
> expected: 
> <{"c1_maxValue":101,"c1_minValue":101,"c1_nullCount":0,"c2_maxValue":" 
> 999sdc","c2_minValue":" 
> 999sdc","c2_nullCount":0,"c3_maxValue":10.329,"c3_minValue":10.329,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.179Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":99,"c5_minValue":99,"c5_nullCount":0,"c6_maxValue":"2020-03-28","c6_minValue":"2020-03-28","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"SA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":562,"c1_minValue":323,"c1_nullCount":0,"c2_maxValue":" 
> 984sdc","c2_minValue":" 
> 980sdc","c2_nullCount":0,"c3_maxValue":977.328,"c3_minValue":64.768,"c3_nullCount":1,"c4_maxValue":"2021-11-19T07:34:44.201Z","c4_minValue":"2021-11-19T07:34:44.181Z","c4_nullCount":0,"c5_maxValue":78,"c5_minValue":34,"c5_nullCount":0,"c6_maxValue":"2020-10-21","c6_minValue":"2020-01-15","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"qw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":4}
> {"c1_maxValue":568,"c1_minValue":8,"c1_nullCount":0,"c2_maxValue":" 
> 8sdc","c2_minValue":" 
> 111sdc","c2_nullCount":0,"c3_maxValue":979.272,"c3_minValue":82.111,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.193Z","c4_minValue":"2021-11-19T07:34:44.159Z","c4_nullCount":0,"c5_maxValue":58,"c5_minValue":2,"c5_nullCount":0,"c6_maxValue":"2020-11-08","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"9g==","c7_minValue":"Ag==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":15}
> {"c1_maxValue":619,"c1_minValue":619,"c1_nullCount":0,"c2_maxValue":" 
> 985sdc","c2_minValue":" 
> 985sdc","c2_nullCount":0,"c3_maxValue":230.320,"c3_minValue":230.320,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":33,"c5_nullCount":0,"c6_maxValue":"2020-02-13","c6_minValue":"2020-02-13","c6_nullCount":0,"c7_maxValue":"QA==","c7_minValue":"QA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":633,"c1_minValue":624,"c1_nullCount":0,"c2_maxValue":" 
> 987sdc","c2_minValue":" 
> 986sdc","c2_nullCount":0,"c3_maxValue":580.317,"c3_minValue":375.308,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":32,"c5_nullCount":0,"c6_maxValue":"2020-10-10","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"PQ==","c7_minValue":"NA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":2}
> {"c1_maxValue":639,"c1_minValue":555,"c1_nullCount":0,"c2_maxValue":" 
> 989sdc","c2_minValue":" 
> 982sdc","c2_nullCount":0,"c3_maxValue":904.304,"c3_minValue":153.431,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.186Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":44,"c5_minValue":31,"c5_nullCount":0,"c6_maxValue":"2020-08-25","c6_minValue":"2020-03-12","c6_nullCount":0,"c7_maxValue":"MA==","c7_minValue":"rw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":3}
> {"c1_maxValue":715,"c1_minValue":76,"c1_nullCount":0,"c2_maxValue":" 
> 76sdc","c2_minValue":" 
> 224sdc","c2_nullCount":0,"c3_maxValue":958.579,"c3_minValue":246.427,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.199Z","c4_minValue":"2021-11-19T07:34:44.166Z","c4_nullCount":0,"c5_maxValue":73,"c5_minValue":9,"c5_nullCount":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-16","c6_nullCount":0,"c7_maxValue":"+g==","c7_minValue":"LA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":12}
> {"c1_maxValue":768,"c1_minValue":59,"c1_nullCount":0,"c2_maxValue":" 
> 768sdc","c2_minValue":" 
> 118sdc","c2_nullCount":0,"c3_maxValue":959.131,"c3_minValue":64.768,"c3_nullCount":0,"c4_maxValue"

[GitHub] [hudi] trushev commented on pull request #7626: [HUDI-5516] Reduce memory footprint on workload with thousand active partitions

2023-01-09 Thread GitBox



trushev commented on PR #7626:
URL: https://github.com/apache/hudi/pull/7626#issuecomment-1376652385

   > Nice catch, @trushev , curious why the closed handle is also taking huge 
resource, we may need to figure it out first.
   > 
   > But I still think the change is valid.
   
   Thank you for the reviewing
   
   Here is create handle layout with nulled writer. My bad to call it "a huge 
object even with released writer". Speaking more precise, it is just bigger 
than we needed. 14 KB against 522 bytes
   
   https://user-images.githubusercontent.com/42293632/211450420-b6fd1e92-6552-4b9f-af78-2244fa22d41e.png";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-5521) Make sure MT Bloom Index partition is covered by tests for Bloom Index

2023-01-09 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-5521:
-

 Summary: Make sure MT Bloom Index partition is covered by tests 
for Bloom Index
 Key: HUDI-5521
 URL: https://issues.apache.org/jira/browse/HUDI-5521
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


We need to make sure existing tests for Bloom Index do cover MT Bloom Index 
pathway



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5463:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.0
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5276) Hudi getAllQueryPartitionPaths use regular match caused Invalid input path add

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5276:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Hudi getAllQueryPartitionPaths use regular match caused Invalid input path 
> add 
> ---
>
> Key: HUDI-5276
> URL: https://issues.apache.org/jira/browse/HUDI-5276
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yuehanwang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When we query sql in hive like:
> select mainwaybillno,
> zonecode,
> accountantcode,
> baroprcode,
> opcode,
> row_number() over(PARTITION BY mainwaybillno, zonecode, opcode ORDER BY 
> barscantm) sn from dm_kafka_rdmp_dw.fvp_core_fact_route_op_hudi_op_new_rt 
> WHERE opcode IN ('50') and inc_day='20221120' limit 10;
> In MapReduce Job the config 
> mapreduce.input.fileinputformat.inputdir=hdfs://dw/hive/warehouse/dm/dm_kafka_rdmp_dw/fvp_core_fact_route_op_hudi_op_new/inc_day=20221120/opcode=50
> But this file split 
> hdfs://dw/hive/warehouse/dm/dm_kafka_rdmp_dw/fvp_core_fact_route_op_hudi_op_new/inc_day=20221120/opcode=5000
>  was added to the job.
> This job was failed and throw exception :
> 2022-11-21 18:11:33,895 INFO [IPC Server handler 1 on 45077] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from 
> attempt_1668750926041_1011874_m_000110_0: Error: java.lang.RuntimeException: 
> java.lang.IllegalStateException: Invalid input path 
> hdfs://dw/hive/warehouse/dm/dm_kafka_rdmp_dw/fvp_core_fact_route_op_hudi_op_new/inc_day=20221120/opcode=501/.0006-2d6e-4d26-93ea-1026632abb67_20221119235956333.log.1_44-150-2
> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.IllegalStateException: Invalid input path 
> hdfs://dw/hive/warehouse/dm/dm_kafka_rdmp_dw/fvp_core_fact_route_op_hudi_op_new/inc_day=20221120/opcode=501/.0006-2d6e-4d26-93ea-1026632abb67_20221119235956333.log.1_44-150-2
> at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:119)
> at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:452)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1106)
> at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:482)
> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
> ... 8 more
> 2022-11-21 18:11:33,897 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics 
> report from attempt_1668750926041_1011874_m_000110_0: Error: 
> java.lang.RuntimeException: java.lang.IllegalStateException: Invalid input 
> path 
> hdfs://dw/hive/warehouse/dm/dm_kafka_rdmp_dw/fvp_core_fact_route_op_hudi_op_new/inc_day=20221120/opcode=501/.0006-2d6e-4d26-93ea-1026632abb67_20221119235956333.log.1_44-150-2
> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.IllegalStateException: Invalid input path 
> hdfs://dw/hive/warehouse/dm/dm_kafka_rdmp_dw/fvp_core_fact_route_op_hudi_op_new/inc_day=20221120/opcode=501/.0006-2d6e-4d26-93ea-1026632abb67_20221119235956333.log.1_44-150-2
> at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:119)
> at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:452)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1106)
> at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:482)
> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapp

[jira] [Updated] (HUDI-5503) Optimize flink table factory option check

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5503:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Optimize flink table factory option check
> -
>
> Key: HUDI-5503
> URL: https://issues.apache.org/jira/browse/HUDI-5503
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: HBG
>Assignee: Danny Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> 1.remove pk and pre-combine field check for source and append mode sink
> 2.fallback to table config if pk or pre-combine field not set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3601) Support multi-arch builds in docker setup

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3601:
-
Sprint: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  
(was: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Support multi-arch builds in docker setup
> -
>
> Key: HUDI-3601
> URL: https://issues.apache.org/jira/browse/HUDI-3601
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Refer [https://github.com/apache/hudi/issues/4985]
> Essentially, our current docker demo runs for linux/amd64 platform but not 
> for arm64. We should support multi-arch builds in a fully automated manner. 
> Ideal would be to simply accept a parameter in setup script:
> {code:java}
> docker/setup_demo.sh --platform linux/arm64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5498) Update docs for reading Hudi tables on Databricks runtime

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5498:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Update docs for reading Hudi tables on Databricks runtime
> -
>
> Key: HUDI-5498
> URL: https://issues.apache.org/jira/browse/HUDI-5498
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> We need to document how users can read Hudi tables on Databricks Spark 
> runtime. 
> Relevant fix: [https://github.com/apache/hudi/pull/7088]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3249) Performance Improvements

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3249:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4937:
-
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5499) Make sure CTAS always uses Bulk Insert

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5499:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Make sure CTAS always uses Bulk Insert
> --
>
> Key: HUDI-5499
> URL: https://issues.apache.org/jira/browse/HUDI-5499
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> There's been a 
> [regression|https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117]
>  where we're not propagating configuration properly b/w 
> {{CreateHoodiTableAsSelectCommand}} and {{InsertIntoHoodieTableCommand}} 
> resulting in CTAS essentially doing an insert when instead it can just do a 
> Bulk Insert.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5465) Fix compaction and rollback handling in MDT for multi-writer scenarios in DT

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5465:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix compaction and rollback handling in MDT for multi-writer scenarios in DT
> 
>
> Key: HUDI-5465
> URL: https://issues.apache.org/jira/browse/HUDI-5465
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> lets say c50 is latest DC in MDT. c49 from DT comes through(multi-writer). 
> triggers compaction in MDT(since ignoring c49 there are no other pending 
> instants in DT). new base instant time is c50. and we add 49.deltacommit to 
> MDT. and during the process we crash. 
> rollback for 49 kicks in DT. When applying rollback of 49 to MDT, we detect 
> 49 has already been compacted since last compacted time is 50 and {*}fail the 
> rollback when we try to apply to MDT{*}.
>  
> We need to fix this entire flow for rollbacks and compaction related 
> multi-writer scenarios. 
>  
> Impact: 
> writes to MDT might fail at some point and users have to disable MDT and make 
> progress



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5430) Fix multi-writer handling w/ rollback blocks in MOR table (log record reader)

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5430:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix multi-writer handling w/ rollback blocks in MOR table (log record reader)
> -
>
> Key: HUDI-5430
> URL: https://issues.apache.org/jira/browse/HUDI-5430
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Sample log blocks and commits
>  
> lb1_c1, lb2_c2, lb3_c3, lb4_c4, lb5_rb_c3, lb6_c5.
>  
> lb3 is expected to be considered invalid. but our current scan does not treat 
> it as invalid. While parsing the rollback block, we just check for the 
> previous log block for matching timestamp. since it does not match, we treat 
> lb5_rb block as invalid and move on.
>  
> in case of multi-writer this is definitely feasible. so we should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5433) Fix the way we deduce the pending instants for MDT writes

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5433:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix the way we deduce the pending instants for MDT writes
> -
>
> Key: HUDI-5433
> URL: https://issues.apache.org/jira/browse/HUDI-5433
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> we trigger compaction in MDT, only when there are no pending inflights apart 
> from the one thats currently updating the MDT. So we use below code snippet 
> for it. 
>  
> {code:java}
>  List pendingInstants = 
> dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
> .findInstantsBefore(instantTime).getInstants(); {code}
> As you could see, we use "findInstantsBefore" which could not yield right 
> results at all times.
>  
> So, we need to find all inflight instants and see if there are any except the 
> current commit thats updating the MDT. If there are any, we should defer 
> compaction.
> Impact:
> writes to MDT might fail if there was any missed inflight and later it was 
> rolledback. Users have to disable MDT and make progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4613) Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4613:
-
Sprint: 2022/09/05, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  
(was: 2022/09/05, 2022/12/12, 0.13.0 Final Sprint)

> Avoid the use of regex expressions when call hoodieFileGroup#addLogFile 
> function
> 
>
> Key: HUDI-4613
> URL: https://issues.apache.org/jira/browse/HUDI-4613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: lei w
>Assignee: lei w
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When the number of logFile files exceeds a certain amount of data, the 
> construction of fsview will become very time-consuming. The reason is that 
> the LogFileComparator#compare method is frequently called when constructing a 
> filegroup, and regular expressions are used in this method.
> {panel:title=build FileSystemView Log }
>  INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=60801, 
> NumFileGroups=200, FileGroupsCreationTime=34036, StoreTimeTaken=2
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3775:
-
Sprint: 2022/09/05, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/09/05, 0.13.0 Final Sprint)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: easyfix, pull-request-available
> Fix For: 0.13.0
>
> Attachments: impressions.avro, run_stuff.txt, scala_commands.txt
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5023) Add new Executor avoiding Queueing in the write-path

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5023:
-
Sprint: 2022/11/15, 2022/11/29, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  
(was: 2022/11/15, 2022/11/29, 0.13.0 Final Sprint)

> Add new Executor avoiding Queueing in the write-path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grow > 1000

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5520:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fail MDT when list of log files grow > 1000
> ---
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5349) Clean up partially failed restore if any

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5349:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Clean up partially failed restore if any
> 
>
> Key: HUDI-5349
> URL: https://issues.apache.org/jira/browse/HUDI-5349
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> If a table was attempted w/ "restore" operation and if it failed mid-way, 
> restore could still be lying around. when re-attempted, a new instant time 
> will be allotted and re-attempted from scratch. but this may thwart 
> compaction progression in MDT. so we need to ensure for a given savepoint, we 
> always re-use restore instant if any. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-83:
---
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync, Usability
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Blocker
>  Labels: pull-request-available, query-eng, sev:critical, 
> user-support-issues
> Fix For: 0.13.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5407) Rollbacks in MDT is not effective

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5407:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Rollbacks in MDT is not effective
> -
>
> Key: HUDI-5407
> URL: https://issues.apache.org/jira/browse/HUDI-5407
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> On rare conditions, rollbacks in MDT is not effective. Apparenlty, we have 
> set cleaning policy to be lazy. hence rollbacks happens only when cleaner 
> kicks in and not when we start a new commit. Given MDT is a single writer 
> table, rollback blocks are effective only when the commit to rollback is just 
> prior to the rollback block. 
>  
> Scenarios where this could fail w/ inline compaction. 
>  
> {code:java}
> Data table timeline
> t1.dc   t2.comp.req. |Crash  t3.dc t2.comp.inflightt2.commit
> MDT timeline
> t1.dc.  t2.comp.inflight |Crash  t3.dc  t4.rb(t2)   t2.dc
> {code}
>  
> The first attempt of t2 in MDT should be rolled back since it crashed 
> mid-way. in other words, if there are any log blocks written by t2 in MDT, it 
> should be deemed invalid. 
>  
> But what happens is, here is how the log blocks are laid out. 
> log1(t1).  log2(t2 first attempt) crash log3 (t3) log4(t4.rb rolling back 
> t2) ... log5 (t2)
>  
> So, when we read the log blocks via AbstractLogRecordReader, ideally we want 
> to ignore log2. but when we encounter log4 for a rollback block, we only 
> check the previous log block for matching commit to rollback. since it does 
> not match w/ t2, we assume log4 is a duplicate rollback and hence still deem 
> log2 as a valid log block. 
> hence MDT could serve more data files which are not valid from a FS based 
> listing standpoint. 
>  
> Impact:
> log blocks to be ignored are considered valid if not for this fix. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5401) Hivemetastore URI set in hudi conf not respected.

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5401:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Hivemetastore URI set in hudi conf not respected.
> -
>
> Key: HUDI-5401
> URL: https://issues.apache.org/jira/browse/HUDI-5401
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vamshi Gudavarthi
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When Hivemetastore URI is set as a hoodie config but not a hadoop-hive conf 
> then sync happens to a local metastore than the actual Hivemetastore URI.
>  
> Problem is here 
> (https://github.com/apache/hudi/blob/master/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java#L81)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5475) not able to generate utilities-slim bundle dependency tree

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5475:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> not able to generate utilities-slim bundle dependency tree
> --
>
> Key: HUDI-5475
> URL: https://issues.apache.org/jira/browse/HUDI-5475
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: pull-request-available
>
> run command
> {code:bash}
> mvn com.github.ferstl:depgraph-maven-plugin:4.0.2:for-artifact \
>   -DgraphFormat=text -DshowGroupIds=true -DshowVersions=true 
> -DrepeatTransitiveDependenciesInTextGraph \
>   -DgroupId=org.apache.hudi -DartifactId=hudi-utilities-slim-bundle_2.12 
> -Dversion=0.12.1
> {code}
> no tree printed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5432) Fix adding back a log block w/ same commit time as previously rolled back one

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5432:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix adding back a log block w/ same commit time as previously rolled back one
> -
>
> Key: HUDI-5432
> URL: https://issues.apache.org/jira/browse/HUDI-5432
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> sample scenario
>  
> lb1_c1, lb2_c2, lb3_c3, lb4_rb_c3, lb5_v4, lb6_c3.
>  
> here we are adding c3 commit, and then rollbacking back c3 (i.e lb3 should be 
> considered invalid).
> after sometime, we add another valid log block with same c3 as commit time. 
> So, we should ensure lb6 is valid here and only lb3 is invalid.
>  
> Especially with repeated attempts of compaction/clustering in DT, very likely 
> this will happen in MDT while trying to update it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5464) Fix instantiation of a new partition in MDT re-using the same instant time as a regular commit

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5464:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix instantiation of a new partition in MDT re-using the same instant time as 
> a regular commit
> --
>
> Key: HUDI-5464
> URL: https://issues.apache.org/jira/browse/HUDI-5464
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> we re-use the same instant time as the commit being applied to MDT while 
> instantiating a new partition in MDT. this needs to be fixed. 
>  
> for eg, lets say we have 10 commits w/ already FILES enabled. 
> for C11, we are enabling col-stats. 
> after data table business, when we enter metadata writer instantiation, we 
> deduct that col-stats has to be instantiated and then instantiate using DC11. 
> in MDT timeline, we see dc11.req. dc11.inflight and dc11.complete. and then 
> we go ahead and apply actual C11 from DT to MDT (dc11.inflight and 
> dc11.complete is updated). here, we overwrite the same DC11 w/ records 
> pertaining to C11. 
> which is buggy. we definitely need to fix this. 
> We can add a suffix to C11 (say C11_003 or C11_001) as we do for compaction 
> and clean in MDT so that any additional operation in MDT has a diff commit 
> time format. For everything else, it should match w/ DT 1 on 1. 
>  
>  
> Impact:
> We are over-riding the same DC for two purposes which is bad. if there is a 
> crash after initializing col-stats and before applying actual C11(in above 
> context), we might mistakenly rollback col-stats initialization, but still 
> table config could say that col stats is fully ready to be served. But while 
> reading MDT, we may not read DC11 since its a failed commit. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5408) Partially failed commits in MDT have to be rolled back in all cases

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5408:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Partially failed commits in MDT have to be rolled back in all cases
> ---
>
> Key: HUDI-5408
> URL: https://issues.apache.org/jira/browse/HUDI-5408
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> when compaction failed after completing in MDT but before completing in DT. 
> and later when we re-attempt to apply the same compaction instant to MDT, we 
> might miss to rollback any partially failed commit in MDT. 
> Code of interest in SparkHoodieBackedTableMetadataWriter:
> {code:java}
> if (!metadataMetaClient.getActiveTimeline().containsInstant(instantTime)) {
>   // if this is a new commit being applied to metadata for the first time
>   writeClient.startCommitWithTime(instantTime);
> } else {
>   Option alreadyCompletedInstant = 
> metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
> -> entry.getTimestamp().equals(instantTime)).lastInstant();
>   if (alreadyCompletedInstant.isPresent()) {
> // this code path refers to a re-attempted commit that got committed to 
> metadata table, but failed in datatable.
> // for eg, lets say compaction c1 on 1st attempt succeeded in metadata 
> table and failed before committing to datatable.
> // when retried again, data table will first rollback pending compaction. 
> these will be applied to metadata table, but all changes
> // are upserts to metadata table and so only a new delta commit will be 
> created.
> // once rollback is complete, compaction will be retried again, which 
> will eventually hit this code block where the respective commit is
> // already part of completed commit. So, we have to manually remove the 
> completed instant and proceed.
> // and it is for the same reason we enabled 
> withAllowMultiWriteOnSameInstant for metadata table.
> HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
> metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
> metadataMetaClient.reloadActiveTimeline();
>   }
>   // If the alreadyCompletedInstant is empty, that means there is a requested 
> or inflight
>   // instant with the same instant time.  This happens for data table clean 
> action which
>   // reuses the same instant time without rollback first.  It is a no-op here 
> as the
>   // clean plan is the same, so we don't need to delete the requested and 
> inflight instant
>   // files in the active timeline.
> } {code}
> incase of else block, if there happen to be a partially failed commit in MDT, 
> we may miss to roll it back. 
> we might need to fix the flow. 
>  
> Imp to consider: 
> even before attempting compaction, we should ensure there are no partially 
> failed commits in MDT. If not, we need to ensure we consider list of valid 
> instants while executing the compaction. 
>  
> Impact:
> some invalid data blocks will be considered valid since we fail to do eager 
> rollbacks. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5485) Improve performance of savepoint with MDT

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5485:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Improve performance of savepoint with MDT
> -
>
> Key: HUDI-5485
> URL: https://issues.apache.org/jira/browse/HUDI-5485
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> [https://github.com/apache/hudi/issues/7541]
> When metadata table is enabled, the savepoint operation is slow for a large 
> number of partitions (e.g., 75k).  The root cause is that for each partition, 
> the metadata table is scanned, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3967) Automatic savepoint in Hudi

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3967:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Automatic savepoint in Hudi
> ---
>
> Key: HUDI-3967
> URL: https://issues.apache.org/jira/browse/HUDI-3967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: table-service
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5319) NPE in Bloom Filter Index

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5319:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/12/12, 0.13.0 Final Sprint)

> NPE in Bloom Filter Index
> -
>
> Key: HUDI-5319
> URL: https://issues.apache.org/jira/browse/HUDI-5319
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> {code:java}
> /12/02 11:05:49 WARN TaskSetManager: Lost task 3.0 in stage 1098.0 (TID 
> 1300185) (ip-172-31-23-246.us-east-2.compute.internal executor 10): 
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieIndexException: 
> Error checking bloom filter index.
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>         at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>         at org.apache.spark.scheduler.Task.run(Task.scala:138)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking 
> bloom filter index.
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 16 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:87)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92)
>         ... 18 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3529) Improve dependency management and bundling

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3529:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Improve dependency management and bundling
> --
>
> Key: HUDI-3529
> URL: https://issues.apache.org/jira/browse/HUDI-3529
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5442:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
>     at 
> io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
>     at io.trino.plugin.hive.Hdfs

[jira] [Updated] (HUDI-5075) Add support to rollback residual clustering after disabling clustering

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5075:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 
Final Sprint, 0.13.0 Final Sprint 2  (was: 2022/10/18, 2022/11/01, 2022/11/15, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Add support to rollback residual clustering after disabling clustering
> --
>
> Key: HUDI-5075
> URL: https://issues.apache.org/jira/browse/HUDI-5075
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> if a user enabled clustering and after sometime disabled it due to whatever 
> reason, there is a chance that there is a pending clustering left in the 
> timeline. But once clustering is disabled, this could just be lying around. 
> but this could affect metadata table compaction whcih in turn might affect 
> the data table archival. 
> so, we need a way to fix this. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1574:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Critical
> Fix For: 0.13.0
>
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5352:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/12/12, 0.13.0 Final Sprint)

> Jackson fails to serialize LocalDate when updating Delta Commit metadata
> 
>
> Key: HUDI-5352
> URL: https://issues.apache.org/jira/browse/HUDI-5352
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due 
> to Jackson not being able to serialize LocalData as is and requiring 
> additional JSR310 dependency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5434) Fix archival in MDT to not rely on rollbacks/clean in DT

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5434:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix archival in MDT to not rely on rollbacks/clean in DT
> 
>
> Key: HUDI-5434
> URL: https://issues.apache.org/jira/browse/HUDI-5434
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> as of now, archival in MDT is guarded until first entry in DT's active 
> timeline. but DT could contain rollback that could date back few days or even 
> weeks. So, we need to fix that to check for first write action in DT (commit, 
> delta commit, replace commit) and then guard MDT archival based on that. 
>  
> Impact:
> could result in huge no of entries in active timeline in MDT. might hamper 
> perf or throttling in cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4586:
-
Sprint: 2022/08/08, 2022/08/22, 2022/09/05, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/08/08, 2022/08/22, 2022/09/05, 0.13.0 Final Sprint)

> Address S3 timeouts in Bloom Index with metadata table
> --
>
> Key: HUDI-4586
> URL: https://issues.apache.org/jira/browse/HUDI-4586
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-08-15 at 17.39.01.png
>
>
> For partitioned table, there are significant number of S3 requests timeout 
> causing the upserts to fail when using Bloom Index with metadata table.
> {code:java}
> Load meta index key ranges for file slices: hudi
> collect at HoodieSparkEngineContext.java:137+details
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
> org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
> org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213)
> org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145)
> org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123)
> org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32)
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329)
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  {code}
> {code:java}
> org.apache.hudi.exception.HoodieException: Exception when reading log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222)
>     at java.util.HashMap.forEach(HashMap.java:1290)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253)
>     at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuf

[jira] [Updated] (HUDI-5443) Fix exception when querying MOR table after applying NestedSchemaPruning optimization

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5443:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Fix exception when querying MOR table after applying NestedSchemaPruning 
> optimization
> -
>
> Key: HUDI-5443
> URL: https://issues.apache.org/jira/browse/HUDI-5443
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
>
> This has been discovered while working on HUDI-5384.
> After NestedSchemaPruning has been applied successfully, reading from MOR 
> table could encountered following exception when actual delta-log file 
> merging would be performed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5392) Fix Bootstrap files reader to configure arrays to be read in the new format

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5392:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/12/12, 0.13.0 Final Sprint)

> Fix Bootstrap files reader to configure arrays to be read in the new format
> ---
>
> Key: HUDI-5392
> URL: https://issues.apache.org/jira/browse/HUDI-5392
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When writing Bootstrap file we’re using Spark writer that writes arrays in 
> the new format, while Hudi reads it in the old (Avro compatible) format:
> {code:java}
>  // Old
>  optional group tip_history (LIST) {
> repeated group array {
>   optional double amount;
>   optional binary currency (UTF8);
> }
>   }
>  // new
>  optional group tip_history (LIST) {
> repeated group list {
>   optional group element {
> optional double amount;
> optional binary currency (UTF8);
>   }
> }
>   } {code}
>  
> To fix that we need to make sure that Bootstrap files are *always* read in a 
> new format (Spark default) unlike Hudi's Parquet files
> We also need to fix TestDataSourceForBootstrap, as it currently doesn't 
> actually assert that the records are written correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3673) Add a common hudi-hbase-shaded for shaded hbase dependencies

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3673:
-
Sprint: 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  
(was: 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Add a common hudi-hbase-shaded for shaded hbase dependencies
> 
>
> Key: HUDI-3673
> URL: https://issues.apache.org/jira/browse/HUDI-3673
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> [https://github.com/apache/hudi/pull/5004]
> A follow-up of HUDI-1180.  Right now, the shading rules for HBase-related 
> dependencies are repeated in different bundles.  We can extract the common 
> shading rules and wrap them in a common hudi-hbase-shaded module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5384) Make sure predicates are appropriately pushed down to HoodieFileIndex when lazy listing

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5384:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/12/12, 0.13.0 Final Sprint)

> Make sure predicates are appropriately pushed down to HoodieFileIndex when 
> lazy listing
> ---
>
> Key: HUDI-5384
> URL: https://issues.apache.org/jira/browse/HUDI-5384
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> After introduction of lazy-listing capability in HUDI-4812, it exposed an 
> issue in Spark's design, where predicates are pushed-down into generic 
> FileIndex implementations only during the execution phase.
> This poses following issues:
>  # HoodieFileIndex isn't listing the table until `listFiles` method is invoked
>  # Listing would actually be performed only during actual execution in 
> `FileSourceScanExac` node
>  # Since listing isn't performed until the actual execution, table statistics 
> are initialized w/ bogus values (of 1 byte) and Cost-based Optimizations 
> (CBO) will be taking incorrect decisions based on that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4911) Make sure LogRecordReader doesn't flush the cache before each lookup

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4911:
-
Sprint: 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Make sure LogRecordReader doesn't flush the cache before each lookup
> 
>
> Key: HUDI-4911
> URL: https://issues.apache.org/jira/browse/HUDI-4911
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently {{HoodieMetadataMergedLogRecordReader }}will flush internal record 
> cache before each lookup which makes every lookup essentially do 
> re-processing of the whole log-blocks stack again.
> We should avoid that and only do the re-parsing incrementally (for the keys 
> that ain't already cached)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3636:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  
(was: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [

[jira] [Updated] (HUDI-4991) Make sure DeltaStreamer passes SSL key/truststore configs connecting to Schema Registry

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4991:
-
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/12/12, 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2  (was: 2022/10/04, 2022/10/18, 2022/11/01, 
2022/12/12, 0.13.0 Final Sprint)

> Make sure DeltaStreamer passes SSL key/truststore configs connecting to 
> Schema Registry
> ---
>
> Key: HUDI-4991
> URL: https://issues.apache.org/jira/browse/HUDI-4991
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Originally reported at:
> [https://github.com/apache/hudi/issues/6842]
>  
> Whenever Schema Registry is used requiring passing keystore/truststore params 
> to access SSL certificates (like below) DeltaStreamer fails:
> {code:java}
> mode.hoodie.deltastreamer.schemaprovider.registry.url=https://schemaregistry.com
> schema.registry.ssl.keystore.location=/artifacts/topics/certs/keystore.jks
> schema.registry.ssl.keystore.password=
> schema.registry.ssl.truststore.location=/artifacts/topics/certs/truststore.jks
> schema.registry.ssl.truststore.password=
> schema.registry.ssl.key.password= {code}
> {code:java}
> at 
> org.apache.hudi.utilities.schema.SchemaRegistryProvider.getSourceSchema(SchemaRegistryProvider.java:109)
> at 
> org.apache.hudi.utilities.schema.SchemaProviderWithPostProcessor.lambda$getSourceSchema$0(SchemaProviderWithPostProcessor.java:41)
> at org.apache.hudi.common.util.Option.map(Option.java:108)
> at 
> org.apache.hudi.utilities.schema.SchemaProviderWithPostProcessor.getSourceSchema(SchemaProviderWithPostProcessor.java:41)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.registerAvroSchemas(DeltaSync.java:839)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:233)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:646)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:142)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:115)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:549)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> at sun.security.ssl.Alert.createSSLException(Alert.java:131)
> at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
> at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
> at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
> at 
> sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654)
> at 
> sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473)
> at 
> sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369)
> at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377)
> at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
> at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422)
> at sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
> at sun.security.ssl.SSLTransport.decode(SSLTransport.java:152)
> at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1397)
> at 
> sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.jav

[jira] [Updated] (HUDI-5321) Fix Bulk Insert ColumnSortPartitioners

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5321:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/12/12, 0.13.0 Final Sprint)

> Fix Bulk Insert ColumnSortPartitioners
> --
>
> Key: HUDI-5321
> URL: https://issues.apache.org/jira/browse/HUDI-5321
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, all of the Custom Bulk Insert ColumnSortPartitioner impls 
> incorrectly return "true" from the "arePartitionRecordsSorted" method, even 
> though records might not necessarily be sorted by the partition-path columns 
> as is required by this method.
> In case when such Partitioner is used and the data is NOT sorted by the list 
> of columns that start w/ partition ones, this could lead to a Parquet writers 
> being closed prematurely when writing files creating a LOT of small files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5160) Spark df saveAsTable failed with CTAS

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5160:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Spark df saveAsTable failed with CTAS
> -
>
> Key: HUDI-5160
> URL: https://issues.apache.org/jira/browse/HUDI-5160
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In 0.9.0 Version,It's ok,But now failed
> {code:java}
> import spark.implicits._
> val partitionValue = "2022-11-05"
> val df = Seq((1, "a1", 10, 1000, partitionValue)).toDF("id", "name", "value", 
> "ts", "dt")
> val tableName = "test_hudi_table"
> // Write a table by spark dataframe.
> df.write.format("hudi")
> .option(HoodieWriteConfig.TBL_NAME.key, tableName)
> .option(TABLE_TYPE.key, MOR_TABLE_TYPE_OPT_VAL)
> // .option(HoodieTableConfig.TYPE.key(), MOR_TABLE_TYPE_OPT_VAL)
> .option(RECORDKEY_FIELD.key, "id")
> .option(PRECOMBINE_FIELD.key, "ts")
> .option(PARTITIONPATH_FIELD.key, "dt")
> .option(KEYGENERATOR_CLASS_NAME.key, classOf[SimpleKeyGenerator].getName)
> .option(HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key, "1")
> .option(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "1")
> .partitionBy("dt")
> .mode(SaveMode.Overwrite)
> .saveAsTable(tableName){code}
>  
> {code:java}
> Can't find primaryKey `uuid` in root
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = false)
>  |-- name: string (nullable = true)
>  |-- value: integer (nullable = false)
>  |-- ts: integer (nullable = false)
>  |-- dt: string (nullable = true)
> .
> java.lang.IllegalArgumentException: Can't find primaryKey `uuid` in root
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = false)
>  |-- name: string (nullable = true)
>  |-- value: integer (nullable = false)
>  |-- ts: integer (nullable = false)
>  |-- dt: string (nullable = true)
> .
>     at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>     at 
> org.apache.spark.sql.hudi.HoodieOptionConfig$$anonfun$validateTable$1.apply(HoodieOptionConfig.scala:201)
>     at 
> org.apache.spark.sql.hudi.HoodieOptionConfig$$anonfun$validateTable$1.apply(HoodieOptionConfig.scala:200)
>     at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>     at 
> org.apache.spark.sql.hudi.HoodieOptionConfig$.validateTable(HoodieOptionConfig.scala:200)
>     at 
> org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.parseSchemaAndConfigs(HoodieCatalogTable.scala:256)
>     at 
> org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.initHoodieTable(HoodieCatalogTable.scala:171)
>     at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:99){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2608:
-
Sprint: 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  
(was: 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Support JSON schema in schema registry provider
> ---
>
> Key: HUDI-2608
> URL: https://issues.apache.org/jira/browse/HUDI-2608
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: sev:normal, user-support-issues
> Fix For: 0.13.0
>
>
> To work with JSON kafka source.
>  
> Original issue
> https://github.com/apache/hudi/issues/3835



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5423) Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5423:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)
> 
>
> Key: HUDI-5423
> URL: https://issues.apache.org/jira/browse/HUDI-5423
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> {code}
> [ERROR] Tests run: 94, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
> 1,729.267 s <<< FAILURE! - in JUnit Vintage
> [ERROR] [8] 
> ColumnStatsTestCase(MERGE_ON_READ,true,true)(testMetadataColumnStatsIndex(ColumnStatsTestCase))
>   Time elapsed: 23.246 s  <<< FAILURE!
> org.opentest4j.AssertionFailedError: 
> expected: 
> <{"c1_maxValue":101,"c1_minValue":101,"c1_nullCount":0,"c2_maxValue":" 
> 999sdc","c2_minValue":" 
> 999sdc","c2_nullCount":0,"c3_maxValue":10.329,"c3_minValue":10.329,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.179Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":99,"c5_minValue":99,"c5_nullCount":0,"c6_maxValue":"2020-03-28","c6_minValue":"2020-03-28","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"SA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":562,"c1_minValue":323,"c1_nullCount":0,"c2_maxValue":" 
> 984sdc","c2_minValue":" 
> 980sdc","c2_nullCount":0,"c3_maxValue":977.328,"c3_minValue":64.768,"c3_nullCount":1,"c4_maxValue":"2021-11-19T07:34:44.201Z","c4_minValue":"2021-11-19T07:34:44.181Z","c4_nullCount":0,"c5_maxValue":78,"c5_minValue":34,"c5_nullCount":0,"c6_maxValue":"2020-10-21","c6_minValue":"2020-01-15","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"qw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":4}
> {"c1_maxValue":568,"c1_minValue":8,"c1_nullCount":0,"c2_maxValue":" 
> 8sdc","c2_minValue":" 
> 111sdc","c2_nullCount":0,"c3_maxValue":979.272,"c3_minValue":82.111,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.193Z","c4_minValue":"2021-11-19T07:34:44.159Z","c4_nullCount":0,"c5_maxValue":58,"c5_minValue":2,"c5_nullCount":0,"c6_maxValue":"2020-11-08","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"9g==","c7_minValue":"Ag==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":15}
> {"c1_maxValue":619,"c1_minValue":619,"c1_nullCount":0,"c2_maxValue":" 
> 985sdc","c2_minValue":" 
> 985sdc","c2_nullCount":0,"c3_maxValue":230.320,"c3_minValue":230.320,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":33,"c5_nullCount":0,"c6_maxValue":"2020-02-13","c6_minValue":"2020-02-13","c6_nullCount":0,"c7_maxValue":"QA==","c7_minValue":"QA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":633,"c1_minValue":624,"c1_nullCount":0,"c2_maxValue":" 
> 987sdc","c2_minValue":" 
> 986sdc","c2_nullCount":0,"c3_maxValue":580.317,"c3_minValue":375.308,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":32,"c5_nullCount":0,"c6_maxValue":"2020-10-10","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"PQ==","c7_minValue":"NA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":2}
> {"c1_maxValue":639,"c1_minValue":555,"c1_nullCount":0,"c2_maxValue":" 
> 989sdc","c2_minValue":" 
> 982sdc","c2_nullCount":0,"c3_maxValue":904.304,"c3_minValue":153.431,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.186Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":44,"c5_minValue":31,"c5_nullCount":0,"c6_maxValue":"2020-08-25","c6_minValue":"2020-03-12","c6_nullCount":0,"c7_maxValue":"MA==","c7_minValue":"rw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":3}
> {"c1_maxValue":715,"c1_minValue":76,"c1_nullCount":0,"c2_maxValue":" 
> 76sdc","c2_minValue":" 
> 224sdc","c2_nullCount":0,"c3_maxValue":958.579,"c3_minValue":246.427,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.199Z","c4_minValue":"2021-11-19T07:34:44.166Z","c4_nullCount":0,"c5_maxValue":73,"c5_minValue":9,"c5_nullCount":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-16","c6_nullCount":0,"c7_maxValue":"+g==","c7_minValue":"LA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":12}
> {"c1_maxValue":768,"c1_minValue":59,"c1_nullCount":0,"c2_maxValue":" 
> 768sdc","c2_minValue":" 
> 118sdc","c2_nullCount":0,"c3_maxValue":959.131,"c3_minValue":64.768,"c3_nullCount":0,"c4_maxValue":"2021-11-

[jira] [Updated] (HUDI-5169) Re-attempt failed rollback (regular commits, clustering) and get it to completion

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5169:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Re-attempt failed rollback (regular commits, clustering) and get it to 
> completion
> -
>
> Key: HUDI-5169
> URL: https://issues.apache.org/jira/browse/HUDI-5169
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> If rollback removed the commit to be rolled back, but failed to complete, we 
> may never get to complete it. bcoz, the original commit that needs to be 
> rolledback is not seen in the timeline. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0&basepath=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test&fileid=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0&lastinstantts=20220225182311228&timelinehash=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPo

[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5238:
-
Sprint: 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 
Final Sprint 2  (was: 2022/11/15, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint)

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5323:
-
Sprint: 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 
2022/12/12, 0.13.0 Final Sprint)

> Decouple virtual key with writing bloom filters to parquet files
> 
>
> Key: HUDI-5323
> URL: https://issues.apache.org/jira/browse/HUDI-5323
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When the virtual key feature is enabled by setting 
> hoodie.populate.meta.fields to false, the bloom filters are not written to 
> parquet base files in the write transactions.  Relevant logic in 
> HoodieFileWriterFactory class:
> {code:java}
> private static  
> HoodieFileWriter newParquetFileWriter(
> String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> HoodieTable hoodieTable,
> TaskContextSupplier taskContextSupplier, boolean populateMetaFields) 
> throws IOException {
>   return newParquetFileWriter(instantTime, path, config, schema, 
> hoodieTable.getHadoopConf(),
>   taskContextSupplier, populateMetaFields, populateMetaFields);
> }
> private static  
> HoodieFileWriter newParquetFileWriter(
> String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> Configuration conf,
> TaskContextSupplier taskContextSupplier, boolean populateMetaFields, 
> boolean enableBloomFilter) throws IOException {
>   Option filter = enableBloomFilter ? 
> Option.of(createBloomFilter(config)) : Option.empty();
>   HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new 
> AvroSchemaConverter(conf).convert(schema), schema, filter);
>   HoodieParquetConfig parquetConfig = new 
> HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
>   config.getParquetBlockSize(), config.getParquetPageSize(), 
> config.getParquetMaxFileSize(),
>   conf, config.getParquetCompressionRatio(), 
> config.parquetDictionaryEnabled());
>   return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, 
> taskContextSupplier, populateMetaFields);
> } {code}
> Given that bloom filters are absent, when using Bloom Index on the same 
> table, the writer encounters NPE (HUDI-5319).
> We should decouple the virtual key feature with bloom filter and always write 
> the bloom filters to the parquet files. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5498) Update docs for reading Hudi tables on Databricks runtime

2023-01-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5498:

Story Points: 0.5  (was: 1)

> Update docs for reading Hudi tables on Databricks runtime
> -
>
> Key: HUDI-5498
> URL: https://issues.apache.org/jira/browse/HUDI-5498
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> We need to document how users can read Hudi tables on Databricks Spark 
> runtime. 
> Relevant fix: [https://github.com/apache/hudi/pull/7088]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2023-01-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3775:
--
Story Points: 1  (was: 2)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: easyfix, pull-request-available
> Fix For: 0.13.0
>
> Attachments: impressions.avro, run_stuff.txt, scala_commands.txt
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5349) Clean up partially failed restore if any

2023-01-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5349:
--
Story Points: 1

> Clean up partially failed restore if any
> 
>
> Key: HUDI-5349
> URL: https://issues.apache.org/jira/browse/HUDI-5349
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> If a table was attempted w/ "restore" operation and if it failed mid-way, 
> restore could still be lying around. when re-attempted, a new instant time 
> will be allotted and re-attempted from scratch. but this may thwart 
> compaction progression in MDT. so we need to ensure for a given savepoint, we 
> always re-use restore instant if any. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grow > 1000

2023-01-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5520:
--
Priority: Blocker  (was: Critical)

> Fail MDT when list of log files grow > 1000
> ---
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-01-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5442:

Story Points: 1  (was: 2)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
>     at 
> io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
>     at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97)
>     at 
> io

[jira] [Updated] (HUDI-5408) Partially failed commits in MDT have to be rolled back in all cases

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5408:
-
Status: In Progress  (was: Open)

> Partially failed commits in MDT have to be rolled back in all cases
> ---
>
> Key: HUDI-5408
> URL: https://issues.apache.org/jira/browse/HUDI-5408
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> when compaction failed after completing in MDT but before completing in DT. 
> and later when we re-attempt to apply the same compaction instant to MDT, we 
> might miss to rollback any partially failed commit in MDT. 
> Code of interest in SparkHoodieBackedTableMetadataWriter:
> {code:java}
> if (!metadataMetaClient.getActiveTimeline().containsInstant(instantTime)) {
>   // if this is a new commit being applied to metadata for the first time
>   writeClient.startCommitWithTime(instantTime);
> } else {
>   Option alreadyCompletedInstant = 
> metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
> -> entry.getTimestamp().equals(instantTime)).lastInstant();
>   if (alreadyCompletedInstant.isPresent()) {
> // this code path refers to a re-attempted commit that got committed to 
> metadata table, but failed in datatable.
> // for eg, lets say compaction c1 on 1st attempt succeeded in metadata 
> table and failed before committing to datatable.
> // when retried again, data table will first rollback pending compaction. 
> these will be applied to metadata table, but all changes
> // are upserts to metadata table and so only a new delta commit will be 
> created.
> // once rollback is complete, compaction will be retried again, which 
> will eventually hit this code block where the respective commit is
> // already part of completed commit. So, we have to manually remove the 
> completed instant and proceed.
> // and it is for the same reason we enabled 
> withAllowMultiWriteOnSameInstant for metadata table.
> HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
> metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
> metadataMetaClient.reloadActiveTimeline();
>   }
>   // If the alreadyCompletedInstant is empty, that means there is a requested 
> or inflight
>   // instant with the same instant time.  This happens for data table clean 
> action which
>   // reuses the same instant time without rollback first.  It is a no-op here 
> as the
>   // clean plan is the same, so we don't need to delete the requested and 
> inflight instant
>   // files in the active timeline.
> } {code}
> incase of else block, if there happen to be a partially failed commit in MDT, 
> we may miss to roll it back. 
> we might need to fix the flow. 
>  
> Imp to consider: 
> even before attempting compaction, we should ensure there are no partially 
> failed commits in MDT. If not, we need to ensure we consider list of valid 
> instants while executing the compaction. 
>  
> Impact:
> some invalid data blocks will be considered valid since we fail to do eager 
> rollbacks. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5408) Partially failed commits in MDT have to be rolled back in all cases

2023-01-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5408:
-
Status: Patch Available  (was: In Progress)

> Partially failed commits in MDT have to be rolled back in all cases
> ---
>
> Key: HUDI-5408
> URL: https://issues.apache.org/jira/browse/HUDI-5408
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> when compaction failed after completing in MDT but before completing in DT. 
> and later when we re-attempt to apply the same compaction instant to MDT, we 
> might miss to rollback any partially failed commit in MDT. 
> Code of interest in SparkHoodieBackedTableMetadataWriter:
> {code:java}
> if (!metadataMetaClient.getActiveTimeline().containsInstant(instantTime)) {
>   // if this is a new commit being applied to metadata for the first time
>   writeClient.startCommitWithTime(instantTime);
> } else {
>   Option alreadyCompletedInstant = 
> metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
> -> entry.getTimestamp().equals(instantTime)).lastInstant();
>   if (alreadyCompletedInstant.isPresent()) {
> // this code path refers to a re-attempted commit that got committed to 
> metadata table, but failed in datatable.
> // for eg, lets say compaction c1 on 1st attempt succeeded in metadata 
> table and failed before committing to datatable.
> // when retried again, data table will first rollback pending compaction. 
> these will be applied to metadata table, but all changes
> // are upserts to metadata table and so only a new delta commit will be 
> created.
> // once rollback is complete, compaction will be retried again, which 
> will eventually hit this code block where the respective commit is
> // already part of completed commit. So, we have to manually remove the 
> completed instant and proceed.
> // and it is for the same reason we enabled 
> withAllowMultiWriteOnSameInstant for metadata table.
> HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
> metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
> metadataMetaClient.reloadActiveTimeline();
>   }
>   // If the alreadyCompletedInstant is empty, that means there is a requested 
> or inflight
>   // instant with the same instant time.  This happens for data table clean 
> action which
>   // reuses the same instant time without rollback first.  It is a no-op here 
> as the
>   // clean plan is the same, so we don't need to delete the requested and 
> inflight instant
>   // files in the active timeline.
> } {code}
> incase of else block, if there happen to be a partially failed commit in MDT, 
> we may miss to roll it back. 
> we might need to fix the flow. 
>  
> Imp to consider: 
> even before attempting compaction, we should ensure there are no partially 
> failed commits in MDT. If not, we need to ensure we consider list of valid 
> instants while executing the compaction. 
>  
> Impact:
> some invalid data blocks will be considered valid since we fail to do eager 
> rollbacks. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5408) Partially failed commits in MDT have to be rolled back in all cases

2023-01-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5408:
--
Priority: Blocker  (was: Critical)

> Partially failed commits in MDT have to be rolled back in all cases
> ---
>
> Key: HUDI-5408
> URL: https://issues.apache.org/jira/browse/HUDI-5408
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> when compaction failed after completing in MDT but before completing in DT. 
> and later when we re-attempt to apply the same compaction instant to MDT, we 
> might miss to rollback any partially failed commit in MDT. 
> Code of interest in SparkHoodieBackedTableMetadataWriter:
> {code:java}
> if (!metadataMetaClient.getActiveTimeline().containsInstant(instantTime)) {
>   // if this is a new commit being applied to metadata for the first time
>   writeClient.startCommitWithTime(instantTime);
> } else {
>   Option alreadyCompletedInstant = 
> metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
> -> entry.getTimestamp().equals(instantTime)).lastInstant();
>   if (alreadyCompletedInstant.isPresent()) {
> // this code path refers to a re-attempted commit that got committed to 
> metadata table, but failed in datatable.
> // for eg, lets say compaction c1 on 1st attempt succeeded in metadata 
> table and failed before committing to datatable.
> // when retried again, data table will first rollback pending compaction. 
> these will be applied to metadata table, but all changes
> // are upserts to metadata table and so only a new delta commit will be 
> created.
> // once rollback is complete, compaction will be retried again, which 
> will eventually hit this code block where the respective commit is
> // already part of completed commit. So, we have to manually remove the 
> completed instant and proceed.
> // and it is for the same reason we enabled 
> withAllowMultiWriteOnSameInstant for metadata table.
> HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
> metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
> metadataMetaClient.reloadActiveTimeline();
>   }
>   // If the alreadyCompletedInstant is empty, that means there is a requested 
> or inflight
>   // instant with the same instant time.  This happens for data table clean 
> action which
>   // reuses the same instant time without rollback first.  It is a no-op here 
> as the
>   // clean plan is the same, so we don't need to delete the requested and 
> inflight instant
>   // files in the active timeline.
> } {code}
> incase of else block, if there happen to be a partially failed commit in MDT, 
> we may miss to roll it back. 
> we might need to fix the flow. 
>  
> Imp to consider: 
> even before attempting compaction, we should ensure there are no partially 
> failed commits in MDT. If not, we need to ensure we consider list of valid 
> instants while executing the compaction. 
>  
> Impact:
> some invalid data blocks will be considered valid since we fail to do eager 
> rollbacks. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5408) Partially failed commits in MDT have to be rolled back in all cases

2023-01-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5408:
--
Story Points: 1  (was: 2)

> Partially failed commits in MDT have to be rolled back in all cases
> ---
>
> Key: HUDI-5408
> URL: https://issues.apache.org/jira/browse/HUDI-5408
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> when compaction failed after completing in MDT but before completing in DT. 
> and later when we re-attempt to apply the same compaction instant to MDT, we 
> might miss to rollback any partially failed commit in MDT. 
> Code of interest in SparkHoodieBackedTableMetadataWriter:
> {code:java}
> if (!metadataMetaClient.getActiveTimeline().containsInstant(instantTime)) {
>   // if this is a new commit being applied to metadata for the first time
>   writeClient.startCommitWithTime(instantTime);
> } else {
>   Option alreadyCompletedInstant = 
> metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
> -> entry.getTimestamp().equals(instantTime)).lastInstant();
>   if (alreadyCompletedInstant.isPresent()) {
> // this code path refers to a re-attempted commit that got committed to 
> metadata table, but failed in datatable.
> // for eg, lets say compaction c1 on 1st attempt succeeded in metadata 
> table and failed before committing to datatable.
> // when retried again, data table will first rollback pending compaction. 
> these will be applied to metadata table, but all changes
> // are upserts to metadata table and so only a new delta commit will be 
> created.
> // once rollback is complete, compaction will be retried again, which 
> will eventually hit this code block where the respective commit is
> // already part of completed commit. So, we have to manually remove the 
> completed instant and proceed.
> // and it is for the same reason we enabled 
> withAllowMultiWriteOnSameInstant for metadata table.
> HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
> metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
> metadataMetaClient.reloadActiveTimeline();
>   }
>   // If the alreadyCompletedInstant is empty, that means there is a requested 
> or inflight
>   // instant with the same instant time.  This happens for data table clean 
> action which
>   // reuses the same instant time without rollback first.  It is a no-op here 
> as the
>   // clean plan is the same, so we don't need to delete the requested and 
> inflight instant
>   // files in the active timeline.
> } {code}
> incase of else block, if there happen to be a partially failed commit in MDT, 
> we may miss to roll it back. 
> we might need to fix the flow. 
>  
> Imp to consider: 
> even before attempting compaction, we should ensure there are no partially 
> failed commits in MDT. If not, we need to ensure we consider list of valid 
> instants while executing the compaction. 
>  
> Impact:
> some invalid data blocks will be considered valid since we fail to do eager 
> rollbacks. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 >

1 - 100 of 261 matches

Mail list logo