Re: [PR] [HUDI-7034] Refresh index fix - remove cached file slices within part… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10151:
URL: https://github.com/apache/hudi/pull/10151#issuecomment-1823920564

   
   ## CI report:
   
   * 190b9df539423cb5da8f01b400426d9e97f7bab4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21098)
 
   * a41d81f6116784b2f006ed4e58ac7a755410f848 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823920316

   
   ## CI report:
   
   * 35fed0de0587b411f9470e1c69db43501df5a725 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21121)
 
   * 6024de8ab05dd38e9cdb58afc10b70991542c392 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21122)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10157:
URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823920594

   
   ## CI report:
   
   * 3b24d4130099aab67c76de81f77701c730f2e78a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


empcl commented on PR #10157:
URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823916790

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823913218

   
   ## CI report:
   
   * 35fed0de0587b411f9470e1c69db43501df5a725 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21121)
 
   * 6024de8ab05dd38e9cdb58afc10b70991542c392 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823907351

   
   ## CI report:
   
   * 868ba59ecf1a08d7b73a7121429103c2134b291f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21119)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823907225

   
   ## CI report:
   
   * 35fed0de0587b411f9470e1c69db43501df5a725 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21121)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


danny0405 commented on code in PR #10002:
URL: https://github.com/apache/hudi/pull/10002#discussion_r1402973135


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/RocksDbBasedFileSystemView.java:
##
@@ -553,6 +553,10 @@ protected void removeReplacedFileIdsAtInstants(Set 
instants) {
 );
   }
 
+  protected boolean hasReplacedFilesInPartition(String partitionPath) {
+throw new 
UnsupportedOperationException("isReplacedFileExistWithinSpecifiedPartition() is 
not supported for RocksDbBasedFileSystemView!");

Review Comment:
   We must support it correctly. Actually there is no need to override it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


danny0405 commented on code in PR #10002:
URL: https://github.com/apache/hudi/pull/10002#discussion_r1402973135


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/RocksDbBasedFileSystemView.java:
##
@@ -553,6 +553,10 @@ protected void removeReplacedFileIdsAtInstants(Set 
instants) {
 );
   }
 
+  protected boolean hasReplacedFilesInPartition(String partitionPath) {
+throw new 
UnsupportedOperationException("isReplacedFileExistWithinSpecifiedPartition() is 
not supported for RocksDbBasedFileSystemView!");

Review Comment:
   We must support it correctly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


danny0405 commented on code in PR #10002:
URL: https://github.com/apache/hudi/pull/10002#discussion_r1402972184


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java:
##
@@ -202,6 +207,13 @@ private Map 
getParamsWithPartitionPath(String partitionPath) {
 return paramsMap;
   }
 
+  private Map getParamsWithPartitionPaths(List 
partitionPaths) {

Review Comment:
   This method is useless now right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


danny0405 commented on code in PR #10002:
URL: https://github.com/apache/hudi/pull/10002#discussion_r1402970319


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -598,35 +603,49 @@ private FileSlice filterUncommittedLogs(FileSlice 
fileSlice) {
   }
 
   protected HoodieFileGroup addBootstrapBaseFileIfPresent(HoodieFileGroup 
fileGroup) {
+return addBootstrapBaseFileIfPresent(fileGroup, 
this::getBootstrapBaseFile);
+  }
+
+  protected HoodieFileGroup addBootstrapBaseFileIfPresent(HoodieFileGroup 
fileGroup, Function> 
bootstrapBaseFileMappingFunc) {
 boolean hasBootstrapBaseFile = fileGroup.getAllFileSlices()
 .anyMatch(fs -> 
fs.getBaseInstantTime().equals(METADATA_BOOTSTRAP_INSTANT_TS));
 if (hasBootstrapBaseFile) {
   HoodieFileGroup newFileGroup = new HoodieFileGroup(fileGroup);
   newFileGroup.getAllFileSlices().filter(fs -> 
fs.getBaseInstantTime().equals(METADATA_BOOTSTRAP_INSTANT_TS))
   .forEach(fs -> fs.setBaseFile(
-  addBootstrapBaseFileIfPresent(fs.getFileGroupId(), 
fs.getBaseFile().get(;
+  addBootstrapBaseFileIfPresent(fs.getFileGroupId(), 
fs.getBaseFile().get(), bootstrapBaseFileMappingFunc)));
   return newFileGroup;
 }
 return fileGroup;
   }
 
   protected FileSlice addBootstrapBaseFileIfPresent(FileSlice fileSlice) {
+return addBootstrapBaseFileIfPresent(fileSlice, 
this::getBootstrapBaseFile);
+  }
+
+  protected FileSlice addBootstrapBaseFileIfPresent(FileSlice fileSlice, 
Function> 
bootstrapBaseFileMappingFunc) {
 if (fileSlice.getBaseInstantTime().equals(METADATA_BOOTSTRAP_INSTANT_TS)) {
   FileSlice copy = new FileSlice(fileSlice);
   copy.getBaseFile().ifPresent(dataFile -> {
 Option edf = 
getBootstrapBaseFile(copy.getFileGroupId());

Review Comment:
   Oops, this line is useless now, can you fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7034] Refresh index fix - remove cached file slices within part… [hudi]

2023-11-22 Thread via GitHub


danny0405 commented on PR #10151:
URL: https://github.com/apache/hudi/pull/10151#issuecomment-1823886771

   Still got some compile issues:
   
   ```scala
   Error:  
/home/runner/work/hudi/hudi/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala:249:
 error: overloaded method constructor HoodieBaseFile with alternatives:
   Error:(x$1: String)org.apache.hudi.common.model.HoodieBaseFile 
   Error:(x$1: 
org.apache.hadoop.fs.FileStatus)org.apache.hudi.common.model.HoodieBaseFile 

   Error:(x$1: 
org.apache.hudi.common.model.HoodieBaseFile)org.apache.hudi.common.model.HoodieBaseFile
   Error:   cannot be applied to 
(org.apache.spark.sql.execution.datasources.FileStatusWithMetadata)
   Error:files.flatMap(_.files).map(new 
HoodieBaseFile(_)).map(_.getCommitTime).distinct
   Error:
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) [hudi]

2023-11-22 Thread via GitHub


xiaolan-bit commented on issue #6297:
URL: https://github.com/apache/hudi/issues/6297#issuecomment-1823881483

   is the jar about parquet have question? my flink version is 1.17.1 but the 
parquet version is 1.13.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) [hudi]

2023-11-22 Thread via GitHub


xiaolan-bit commented on issue #6297:
URL: https://github.com/apache/hudi/issues/6297#issuecomment-1823880643

   when i use select * ,an error appear:java.lang.LinkageError: 
org/apache/parquet/column/ColumnDescriptor
at 
org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.(AbstractColumnReader.java:108)
at 
org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.(BytesColumnReader.java:35)
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:364)
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:329)
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.readNextRowGroup(ParquetColumnarRowSplitReader.java:334)
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.nextBatch(ParquetColumnarRowSplitReader.java:310)
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.ensureBatch(ParquetColumnarRowSplitReader.java:292)
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.reachedEnd(ParquetColumnarRowSplitReader.java:271)
at 
org.apache.hudi.table.format.ParquetSplitRecordIterator.hasNext(ParquetSplitRecordIterator.java:42)
at 
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.reachedEnd(CopyOnWriteInputFormat.java:283)
at 
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) [hudi]

2023-11-22 Thread via GitHub


xiaolan-bit commented on issue #6297:
URL: https://github.com/apache/hudi/issues/6297#issuecomment-1823878299

   how to slove this question?  add or replace any jar?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823878156

   
   ## CI report:
   
   * fc27baa8c2df9135bc6e4b0d14e50a127ecb434f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21100)
 
   * 35fed0de0587b411f9470e1c69db43501df5a725 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] The INSERT records are marked as UPDATE [hudi]

2023-11-22 Thread via GitHub


zdl1 commented on issue #10156:
URL: https://github.com/apache/hudi/issues/10156#issuecomment-1823873879

   > there is no way to figure out whether a key has been written to an 
existing bucket before, except the first file slice, so all the records are 
updates.
   
   Thanks for the explanation, it really makes sense, I am wondering whether 
there is a method to get the real number of current records after some 
delta_commits, I used the numberInserts, numberUpdates and numberDeletes to 
count the number before, but it seems like this way doesn't work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823873637

   
   ## CI report:
   
   * 032ad417971148eec41a5d41066b37d238ecf70a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21114)
 
   * 08794fc20eeb2736265520d170d0ee64794a842e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21120)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823868662

   
   ## CI report:
   
   * 032ad417971148eec41a5d41066b37d238ecf70a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21114)
 
   * 08794fc20eeb2736265520d170d0ee64794a842e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


zhangyue19921010 commented on code in PR #10157:
URL: https://github.com/apache/hudi/pull/10157#discussion_r1402949484


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java:
##
@@ -510,6 +511,9 @@ private void initTableIfNotExists(ObjectPath tablePath, 
CatalogTable catalogTabl
 }
 
 flinkConf.setString(FlinkOptions.TABLE_NAME, tablePath.getObjectName());
+
+StreamerUtil.checkPreCombineKey(flinkConf, ((ResolvedCatalogTable) 
catalogTable).getResolvedSchema());

Review Comment:
   https://github.com/apache/hudi/assets/69956021/cf1e72d6-07ce-4c86-b1a2-3e5a07b556f5;>
   
   Looks like UT failed related to this change. please take a look. 
   



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java:
##
@@ -510,6 +511,9 @@ private void initTableIfNotExists(ObjectPath tablePath, 
CatalogTable catalogTabl
 }
 
 flinkConf.setString(FlinkOptions.TABLE_NAME, tablePath.getObjectName());
+
+StreamerUtil.checkPreCombineKey(flinkConf, ((ResolvedCatalogTable) 
catalogTable).getResolvedSchema());

Review Comment:
   Also it is better to modify/create related UTs to check percombine field in 
hoodie.properties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823839975

   
   ## CI report:
   
   * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   * 868ba59ecf1a08d7b73a7121429103c2134b291f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21119)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


nsivabalan commented on code in PR #10158:
URL: https://github.com/apache/hudi/pull/10158#discussion_r1402927461


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -801,24 +765,25 @@ private HoodieWriteConfig 
prepareHoodieConfigForRowWriter(Schema writerSchema) {
*
* @param instantTime instant time to use for ingest.
* @param inputBatch  input batch that contains the records, 
checkpoint, and schema provider
-   * @param inputIsEmpty true if input batch is empty.
* @param metrics Metrics
* @param overallTimerContext Timer Context
* @return Option Compaction instant if one is scheduled
*/
-  private Pair, JavaRDD> 
writeToSinkAndDoMetaSync(String instantTime, InputBatch inputBatch, boolean 
inputIsEmpty,
+  private Pair, JavaRDD> 
writeToSinkAndDoMetaSync(String instantTime, InputBatch inputBatch,
   
HoodieIngestionMetrics metrics,
   
Timer.Context overallTimerContext) {
 Option scheduledCompactionInstant = Option.empty();
 // write to hudi and fetch result
-Pair  writeClientWriteResultIsEmptyPair = 
writeToSink(inputBatch, instantTime, inputIsEmpty);
-JavaRDD writeStatusRDD = 
writeClientWriteResultIsEmptyPair.getKey().getWriteStatusRDD();
-Map> partitionToReplacedFileIds = 
writeClientWriteResultIsEmptyPair.getKey().getPartitionToReplacedFileIds();
-boolean isEmpty = writeClientWriteResultIsEmptyPair.getRight();
+WriteClientWriteResult  writeClientWriteResult = writeToSink(inputBatch, 
instantTime);
+JavaRDD writeStatusRDD = 
writeClientWriteResult.getWriteStatusRDD();
+Map> partitionToReplacedFileIds = 
writeClientWriteResult.getPartitionToReplacedFileIds();
 
 // process write status
 long totalErrorRecords = 
writeStatusRDD.mapToDouble(WriteStatus::getTotalErrorRecords).sum().longValue();
 long totalRecords = 
writeStatusRDD.mapToDouble(WriteStatus::getTotalRecords).sum().longValue();
+long totalSuccessfulRecords = totalRecords - totalErrorRecords;
+LOG.info(String.format("instantTime=%s, totalRecords=%d, 
totalErrorRecords=%d, totalSuccessfulRecords=%d",

Review Comment:
   can we have an explicit log statement when there are no records ingested, 
but just for checkpoint purpose we had to trigger the commit. I guess prior to 
this patch, we will print "No new data, perform empty commit.". May be 
something similar. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823835454

   
   ## CI report:
   
   * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   * 868ba59ecf1a08d7b73a7121429103c2134b291f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7120] Performance improvements in deltastreamer executor code path (#10135)

2023-11-22 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b77eff2522a [HUDI-7120] Performance improvements in deltastreamer 
executor code path (#10135)
b77eff2522a is described below

commit b77eff2522a975b0c456332d20eaea6eed882774
Author: Lokesh Jain 
AuthorDate: Thu Nov 23 10:47:40 2023 +0530

[HUDI-7120] Performance improvements in deltastreamer executor code path 
(#10135)
---
 .../hudi/io/HoodieKeyLocationFetchHandle.java  |   4 +-
 .../org/apache/hudi/AvroConversionUtils.scala  |   9 +
 .../java/org/apache/hudi/avro/AvroSchemaUtils.java |  22 +-
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  58 +++--
 .../java/org/apache/hudi/common/fs/FSUtils.java|   9 +-
 .../org/apache/hudi/TestAvroConversionUtils.scala  | 248 +++--
 6 files changed, 186 insertions(+), 164 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java
index 135e4866cc5..ab41a94c2a9 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java
@@ -62,9 +62,11 @@ public class HoodieKeyLocationFetchHandle 
extends HoodieReadHandle> locations() {
 HoodieBaseFile baseFile = partitionPathBaseFilePair.getRight();
+String commitTime = baseFile.getCommitTime();
+String fileId = baseFile.getFileId();
 return fetchRecordKeysWithPositions(baseFile).stream()
 .map(entry -> Pair.of(entry.getLeft(),
-new HoodieRecordLocation(baseFile.getCommitTime(), 
baseFile.getFileId(), entry.getRight(;
+new HoodieRecordLocation(commitTime, fileId, entry.getRight(;
   }
 
   public Stream> globalLocations() {
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
index 818bf760047..d84679eaf92 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
@@ -18,6 +18,7 @@
 
 package org.apache.hudi
 
+import org.apache.avro.Schema.Type
 import org.apache.avro.generic.GenericRecord
 import org.apache.avro.{JsonProperties, Schema}
 import org.apache.hudi.HoodieSparkUtils.sparkAdapter
@@ -242,4 +243,12 @@ object AvroConversionUtils {
 val nameParts = qualifiedName.split('.')
 (nameParts.last, nameParts.init.mkString("."))
   }
+
+  private def handleUnion(schema: Schema): Schema = {
+if (schema.getType == Type.UNION) {
+  val index = if (schema.getTypes.get(0).getType == Schema.Type.NULL) 1 
else 0
+  return schema.getTypes.get(index)
+}
+schema
+  }
 }
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
index fcfc8a4f0b9..3c5486c47c7 100644
--- a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
@@ -249,6 +249,11 @@ public class AvroSchemaUtils {
 }
 
 List innerTypes = schema.getTypes();
+if (innerTypes.size() == 2 && isNullable(schema)) {
+  // this is a basic nullable field so handle it more efficiently
+  return resolveNullableSchema(schema);
+}
+
 Schema nonNullType =
 innerTypes.stream()
 .filter(it -> it.getType() != Schema.Type.NULL && 
Objects.equals(it.getFullName(), fieldSchemaFullName))
@@ -286,18 +291,19 @@ public class AvroSchemaUtils {
 }
 
 List innerTypes = schema.getTypes();
-Schema nonNullType =
-innerTypes.stream()
-.filter(it -> it.getType() != Schema.Type.NULL)
-.findFirst()
-.orElse(null);
 
-if (innerTypes.size() != 2 || nonNullType == null) {
+if (innerTypes.size() != 2) {
   throw new AvroRuntimeException(
   String.format("Unsupported Avro UNION type %s: Only UNION of a null 
type and a non-null type is supported", schema));
 }
-
-return nonNullType;
+Schema firstInnerType = innerTypes.get(0);
+Schema secondInnerType = innerTypes.get(1);
+if ((firstInnerType.getType() != Schema.Type.NULL && 
secondInnerType.getType() != Schema.Type.NULL)
+|| (firstInnerType.getType() == Schema.Type.NULL && 
secondInnerType.getType() == Schema.Type.NULL)) {
+  throw new AvroRuntimeException(
+  String.format("Unsupported Avro UNION type %s: Only UNION of a null 
type and a non-null type is supported", 

Re: [PR] [HUDI-7120] Performance improvements in deltastreamer executor code path [hudi]

2023-11-22 Thread via GitHub


nsivabalan merged PR #10135:
URL: https://github.com/apache/hudi/pull/10135


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (#10095)

2023-11-22 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 405be173664 [MINOR] Making misc fixes to deltastreamer sources(S3 and 
GCS) (#10095)
405be173664 is described below

commit 405be173664b724ca941194136a5b5dcff4bb598
Author: Sivabalan Narayanan 
AuthorDate: Wed Nov 22 21:00:33 2023 -0800

[MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (#10095)

* Making misc fixes to deltastreamer sources

* Fixing test failures

* adding inference to CloudSourceconfig... cloud.data.datafile.format

* Fix the tests for s3 events source

* Fix the tests for s3 events source

-

Co-authored-by: rmahindra123 
---
 .../main/java/org/apache/hudi/common/util/StringUtils.java| 10 ++
 .../java/org/apache/hudi/common/util/TestStringUtils.java |  7 +++
 .../org/apache/hudi/utilities/config/CloudSourceConfig.java   |  2 +-
 .../apache/hudi/utilities/schema/SchemaRegistryProvider.java  | 11 +--
 .../hudi/utilities/sources/S3EventsHoodieIncrSource.java  | 11 ++-
 5 files changed, 37 insertions(+), 4 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
index d7d79796aec..5b95bc60312 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
@@ -173,4 +173,14 @@ public class StringUtils {
 }
 return input.substring(0, i);
   }
+
+  public static String truncate(String str, int headLength, int tailLength) {
+if (isNullOrEmpty(str) || str.length() <= headLength + tailLength) {
+  return str;
+}
+String head = str.substring(0, headLength);
+String tail = str.substring(str.length() - tailLength);
+
+return head + "..." + tail;
+  }
 }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
index 3bdf6d48b39..54985056bf0 100644
--- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
+++ b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
@@ -114,4 +114,11 @@ public class TestStringUtils {
 }
 return sb.toString();
   }
+
+  @Test
+  public void testTruncate() {
+assertNull(StringUtils.truncate(null, 10, 10));
+assertEquals("http://use...ons/latest;, 
StringUtils.truncate("http://username:passw...@myregistry.com:5000/versions/latest;,
 10, 10));
+assertEquals("http://abc.com;, StringUtils.truncate("http://abc.com;, 10, 
10));
+  }
 }
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java
index e7b44cf9121..007d36fc704 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java
@@ -108,7 +108,7 @@ public class CloudSourceConfig extends HoodieConfig {
 
   public static final ConfigProperty DATAFILE_FORMAT = ConfigProperty
   .key(STREAMER_CONFIG_PREFIX + "source.cloud.data.datafile.format")
-  .defaultValue("parquet")
+  .defaultValue(HoodieIncrSourceConfig.SOURCE_FILE_FORMAT.defaultValue())
   .withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + 
"source.cloud.data.datafile.format")
   .markAdvanced()
   .withDocumentation("Format of the data file. By default, this will be 
the same as hoodie.streamer.source.hoodieincr.file.format");
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java
index 780fbb9dc0a..110c8cc2fb1 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java
@@ -195,7 +195,10 @@ public class SchemaRegistryProvider extends SchemaProvider 
{
 try {
   return parseSchemaFromRegistry(registryUrl);
 } catch (Exception e) {
-  throw new HoodieSchemaFetchException("Error reading source schema from 
registry :" + registryUrl, e);
+  throw new HoodieSchemaFetchException(String.format(
+  "Error reading source schema from registry. Please check %s is 
configured correctly. Truncated URL: %s",
+  Config.SRC_SCHEMA_REGISTRY_URL_PROP,
+  StringUtils.truncate(registryUrl, 10, 10)), e);
 }
   }
 
@@ -207,7 +210,11 @@ public class SchemaRegistryProvider extends SchemaProvider 
{
 try {
 

Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]

2023-11-22 Thread via GitHub


codope merged PR #10095:
URL: https://github.com/apache/hudi/pull/10095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-22 Thread via GitHub


danny0405 commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823820723

   Thanks for the contribution, I have reviewed and created a patch:
   
   
[7041.patch.zip](https://github.com/apache/hudi/files/13446123/7041.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (72ff9a7f0c9 -> 3d212853724)

2023-11-22 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 72ff9a7f0c9 [HUDI-7052] Fix partition key validation for custom key 
generators. (#10014)
 add 3d212853724 [HUDI-7112] Reuse existing timeline server and performance 
improvements (#10122)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/client/BaseHoodieClient.java   |   2 +-
 .../embedded/EmbeddedTimelineServerHelper.java |  38 +
 .../client/embedded/EmbeddedTimelineService.java   | 172 +--
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   4 +-
 .../marker/TimelineServerBasedWriteMarkers.java|  13 +-
 .../org/apache/hudi/util/HttpRequestClient.java|  12 +-
 .../embedded/TestEmbeddedTimelineService.java  | 189 +
 .../client/TestHoodieJavaWriteClientInsert.java|   6 +-
 .../hudi/client/TestHoodieClientMultiWriter.java   |  35 +++-
 .../hudi/client/TestSparkRDDWriteClient.java   |   6 +-
 .../TestRemoteFileSystemViewWithMetadataTable.java |  42 +++--
 hudi-common/pom.xml|   4 +
 .../hudi/common/table/timeline/dto/DTOUtils.java   |   4 +-
 .../view/RemoteHoodieTableFileSystemView.java  |  70 
 .../org/apache/hudi/sink/TestWriteCopyOnWrite.java |  89 ++
 .../hudi/sink/TestWriteMergeOnReadWithCompact.java |   8 +
 .../org/apache/hudi/sink/utils/TestWriteBase.java  |   6 +-
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |   1 +
 .../hudi/timeline/service/RequestHandler.java  |   5 +-
 .../hudi/timeline/service/TimelineService.java |   8 +-
 .../timeline/service/handlers/BaseFileHandler.java |  11 +-
 .../service/handlers/marker/MarkerDirState.java|   3 +-
 .../apache/hudi/utilities/streamer/StreamSync.java |   2 +-
 .../deltastreamer/TestHoodieDeltaStreamer.java |   1 -
 pom.xml|   8 +
 25 files changed, 566 insertions(+), 173 deletions(-)
 create mode 100644 
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/embedded/TestEmbeddedTimelineService.java



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-22 Thread via GitHub


nsivabalan merged PR #10122:
URL: https://github.com/apache/hudi/pull/10122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-22 Thread via GitHub


nsivabalan commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823818490

   https://github.com/apache/hudi/assets/513218/43a50fef-afef-4a80-b54a-75d5fe1260d3;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7052] Fix partition key validation for custom key generators. (#10014)

2023-11-22 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 72ff9a7f0c9 [HUDI-7052] Fix partition key validation for custom key 
generators. (#10014)
72ff9a7f0c9 is described below

commit 72ff9a7f0c9a7da12810669ca0111761ee7adcfe
Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com>
AuthorDate: Wed Nov 22 20:49:15 2023 -0800

[HUDI-7052] Fix partition key validation for custom key generators. (#10014)


-

Co-authored-by: rmahindra123 
---
 .../AutoRecordGenWrapperAvroKeyGenerator.java  | 27 +---
 .../hudi/keygen/AutoRecordKeyGeneratorWrapper.java | 32 +++
 .../keygen/AutoRecordGenWrapperKeyGenerator.java   | 48 ++
 .../org/apache/hudi/util/SparkKeyGenUtils.scala| 31 --
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  4 +-
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  5 ++-
 .../org/apache/hudi/TestHoodieSparkSqlWriter.scala |  2 +-
 .../apache/hudi/functional/TestCOWDataSource.scala |  3 +-
 .../deltastreamer/TestHoodieDeltaStreamer.java |  6 +--
 9 files changed, 112 insertions(+), 46 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java
index a8ae48e1d67..8431180a2fe 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java
@@ -43,24 +43,24 @@ import java.util.List;
  * PartitionId refers to spark's partition Id.
  * RowId refers to the row index within the spark partition.
  */
-public class AutoRecordGenWrapperAvroKeyGenerator extends BaseKeyGenerator {
+public class AutoRecordGenWrapperAvroKeyGenerator extends BaseKeyGenerator 
implements AutoRecordKeyGeneratorWrapper {
 
   private final BaseKeyGenerator keyGenerator;
-  private final int partitionId;
-  private final String instantTime;
+  private Integer partitionId;
+  private String instantTime;
   private int rowId;
 
   public AutoRecordGenWrapperAvroKeyGenerator(TypedProperties config, 
BaseKeyGenerator keyGenerator) {
 super(config);
 this.keyGenerator = keyGenerator;
 this.rowId = 0;
-this.partitionId = 
config.getInteger(KeyGenUtils.RECORD_KEY_GEN_PARTITION_ID_CONFIG);
-this.instantTime = 
config.getString(KeyGenUtils.RECORD_KEY_GEN_INSTANT_TIME_CONFIG);
+partitionId = null;
+instantTime = null;
   }
 
   @Override
   public String getRecordKey(GenericRecord record) {
-return HoodieRecord.generateSequenceId(instantTime, partitionId, rowId++);
+return generateSequenceId(rowId++);
   }
 
   @Override
@@ -80,4 +80,19 @@ public class AutoRecordGenWrapperAvroKeyGenerator extends 
BaseKeyGenerator {
   public boolean isConsistentLogicalTimestampEnabled() {
 return keyGenerator.isConsistentLogicalTimestampEnabled();
   }
+
+  @Override
+  public BaseKeyGenerator getPartitionKeyGenerator() {
+return keyGenerator;
+  }
+
+  private String generateSequenceId(long recordIndex) {
+if (partitionId == null) {
+  this.partitionId = 
config.getInteger(KeyGenUtils.RECORD_KEY_GEN_PARTITION_ID_CONFIG);
+}
+if (instantTime == null) {
+  this.instantTime = 
config.getString(KeyGenUtils.RECORD_KEY_GEN_INSTANT_TIME_CONFIG);
+}
+return HoodieRecord.generateSequenceId(instantTime, partitionId, 
recordIndex);
+  }
 }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordKeyGeneratorWrapper.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordKeyGeneratorWrapper.java
new file mode 100644
index 000..e136bc89cbb
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordKeyGeneratorWrapper.java
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and 

Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]

2023-11-22 Thread via GitHub


nsivabalan merged PR #10014:
URL: https://github.com/apache/hudi/pull/10014


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]

2023-11-22 Thread via GitHub


nsivabalan commented on PR #10014:
URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823817928

   https://github.com/apache/hudi/assets/513218/f0efc544-a78a-4ee3-bed7-f403aea335fb;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823807872

   
   ## CI report:
   
   * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102)
 
   * 032ad417971148eec41a5d41066b37d238ecf70a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823804491

   
   ## CI report:
   
   * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102)
 
   * 032ad417971148eec41a5d41066b37d238ecf70a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]

2023-11-22 Thread via GitHub


zyclove commented on issue #9826:
URL: https://github.com/apache/hudi/issues/9826#issuecomment-1823781426

   Hi, this issue occurs frequently, has it been resolved? As 
https://issues.apache.org/jira/browse/HUDI-6980 is not closed.
   When will version 0.14.1 be released? There is an urgent need to upgrade, 
including other issues.
   
   
   @ad1happy2go @pravin1406 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823776282

   
   ## CI report:
   
   * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Cannot encode decimal with precision 15 as max precision 14 [hudi]

2023-11-22 Thread via GitHub


njalan closed issue #10160:  Cannot encode decimal with precision 15 as max 
precision 14
URL: https://github.com/apache/hudi/issues/10160


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]

2023-11-22 Thread via GitHub


majian1998 commented on code in PR #10120:
URL: https://github.com/apache/hudi/pull/10120#discussion_r1402861158


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala:
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.metadata.HoodieTableMetadata
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter}
+
+
+class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "partition", DataTypes.StringType),
+ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField("file_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("column_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("min_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("max_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("null_num", DataTypes.LongType, nullable = true, 
Metadata.empty)
+  ))
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+super.checkArgs(PARAMETERS, args)
+
+val table = getArgValueOrDefault(args, PARAMETERS(0))
+val partitions = getArgValueOrDefault(args, 
PARAMETERS(1)).getOrElse("").toString
+val partitionsSeq = partitions.split(",").filter(_.nonEmpty).toSeq
+
+val targetColumns = getArgValueOrDefault(args, 
PARAMETERS(2)).getOrElse("").toString
+val targetColumnsSeq = targetColumns.split(",").toSeq
+val basePath = getBasePath(table)
+val metadataConfig = HoodieMetadataConfig.newBuilder
+  .enable(true)
+  .build
+val metaClient = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+val schemaUtil = new TableSchemaResolver(metaClient)
+val schema = 
AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema)
+val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, 
metadataConfig, metaClient)
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = 
columnStatsIndex.loadColumnStatsIndexRecords(targetColumnsSeq, 
shouldReadInMemory = false)
+val fsView = buildFileSystemView(table)
+val allFileSlices: Set[FileSlice] = {
+  if (partitionsSeq.isEmpty) {
+val engineCtx = new HoodieSparkEngineContext(jsc)
+val metaTable = HoodieTableMetadata.create(engineCtx, metadataConfig, 
basePath)
+metaTable.getAllPartitionPaths
+  .asScala
+  .flatMap(path => fsView.getLatestFileSlices(path).iterator().asScala)
+  .toSet
+  } else {
+partitionsSeq
+  

Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]

2023-11-22 Thread via GitHub


stream2000 commented on code in PR #10120:
URL: https://github.com/apache/hudi/pull/10120#discussion_r1402852595


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala:
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.metadata.HoodieTableMetadata
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter}
+
+
+class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "partition", DataTypes.StringType),
+ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField("file_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("column_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("min_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("max_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("null_num", DataTypes.LongType, nullable = true, 
Metadata.empty)
+  ))
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+super.checkArgs(PARAMETERS, args)
+
+val table = getArgValueOrDefault(args, PARAMETERS(0))
+val partitions = getArgValueOrDefault(args, 
PARAMETERS(1)).getOrElse("").toString
+val partitionsSeq = partitions.split(",").filter(_.nonEmpty).toSeq
+
+val targetColumns = getArgValueOrDefault(args, 
PARAMETERS(2)).getOrElse("").toString
+val targetColumnsSeq = targetColumns.split(",").toSeq
+val basePath = getBasePath(table)
+val metadataConfig = HoodieMetadataConfig.newBuilder
+  .enable(true)
+  .build
+val metaClient = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+val schemaUtil = new TableSchemaResolver(metaClient)
+val schema = 
AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema)
+val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, 
metadataConfig, metaClient)

Review Comment:
   We should use `org.apache.hudi.metadata.BaseTableMetadata#getColumnStats` to 
load colunm stats instead of calling columnStatsIndex directly. 



##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala:
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, 

[jira] [Closed] (HUDI-7110) Add call procedure for show column stats information

2023-11-22 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7110.

Resolution: Fixed

Fixed via master branch: 8d6d04387753662a5bb41f35874c6bbdd7021b36

> Add call procedure for show column stats information
> 
>
> Key: HUDI-7110
> URL: https://issues.apache.org/jira/browse/HUDI-7110
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This feature introduces a call procedure that allows users to specify the 
> table name and column names to retrieve column stats information from the 
> metadata table. This functionality facilitates the observation of data 
> distribution status and assists in data skipping.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7110) Add call procedure for show column stats information

2023-11-22 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7110:
-
Fix Version/s: 1.0.0

> Add call procedure for show column stats information
> 
>
> Key: HUDI-7110
> URL: https://issues.apache.org/jira/browse/HUDI-7110
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This feature introduces a call procedure that allows users to specify the 
> table name and column names to retrieve column stats information from the 
> metadata table. This functionality facilitates the observation of data 
> distribution status and assists in data skipping.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]

2023-11-22 Thread via GitHub


danny0405 merged PR #10120:
URL: https://github.com/apache/hudi/pull/10120


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7110] Add call procedure for show column stats information (#10120)

2023-11-22 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d6d0438775 [HUDI-7110] Add call procedure for show column stats 
information (#10120)
8d6d0438775 is described below

commit 8d6d04387753662a5bb41f35874c6bbdd7021b36
Author: majian <47964462+majian1...@users.noreply.github.com>
AuthorDate: Thu Nov 23 10:08:17 2023 +0800

[HUDI-7110] Add call procedure for show column stats information (#10120)
---
 .../org/apache/hudi/ColumnStatsIndexSupport.scala  |   2 +-
 .../hudi/command/procedures/HoodieProcedures.scala |   1 +
 .../ShowMetadataTableColumnStatsProcedure.scala| 169 +
 .../sql/hudi/procedure/TestMetadataProcedure.scala |  66 
 4 files changed, 237 insertions(+), 1 deletion(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
index dd76aee2f18..9cdb15092b0 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
@@ -309,7 +309,7 @@ class ColumnStatsIndexSupport(spark: SparkSession,
 colStatsDF.select(targetColumnStatsIndexColumns.map(col): _*)
   }
 
-  private def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
+  def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
 // Read Metadata Table's Column Stats Index records into [[HoodieData]] 
container by
 //- Fetching the records from CSI by key-prefixes (encoded column 
names)
 //- Extracting [[HoodieMetadataColumnStats]] records
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index ad63ddbb29e..1a960ecb8fd 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -66,6 +66,7 @@ object HoodieProcedures {
   ,(ShowBootstrapPartitionsProcedure.NAME, 
ShowBootstrapPartitionsProcedure.builder)
   ,(UpgradeTableProcedure.NAME, UpgradeTableProcedure.builder)
   ,(DowngradeTableProcedure.NAME, DowngradeTableProcedure.builder)
+  ,(ShowMetadataTableColumnStatsProcedure.NAME, 
ShowMetadataTableColumnStatsProcedure.builder)
   ,(ShowMetadataTableFilesProcedure.NAME, 
ShowMetadataTableFilesProcedure.builder)
   ,(ShowMetadataTablePartitionsProcedure.NAME, 
ShowMetadataTablePartitionsProcedure.builder)
   ,(CreateMetadataTableProcedure.NAME, 
CreateMetadataTableProcedure.builder)
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala
new file mode 100644
index 000..60aa0f054b9
--- /dev/null
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import 

Re: [I] Cannot encode decimal with precision 15 as max precision 14 [hudi]

2023-11-22 Thread via GitHub


njalan commented on issue #10160:
URL: https://github.com/apache/hudi/issues/10160#issuecomment-1823733255

   @ad1happy2go It is already merged in 0.13.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10095:
URL: https://github.com/apache/hudi/pull/10095#issuecomment-1823714576

   
   ## CI report:
   
   * c1b5bd41ac1f4be476fb69f84f7197a27733eb23 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Remove unused import (#10159)

2023-11-22 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new aabaa9947fc [MINOR] Remove unused import (#10159)
aabaa9947fc is described below

commit aabaa9947fc0e6a72ed221f0889cad27423f8127
Author: huangxiaoping <1754789...@qq.com>
AuthorDate: Thu Nov 23 09:06:45 2023 +0800

[MINOR] Remove unused import (#10159)
---
 .../org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala   |  5 -
 .../src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala  | 10 --
 .../scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala |  1 -
 3 files changed, 4 insertions(+), 12 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala
index bd052c086ff..6ff9bd036e8 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala
@@ -22,11 +22,6 @@ package org.apache.hudi
 import org.apache.hudi.common.model.HoodieFileGroupId
 import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
-import org.apache.spark.sql.types.{DataType, Decimal}
-import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
-
-import java.util
 
 class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow,
  fileGroups: Map[HoodieFileGroupId, 
List[HoodieCDCFileSplit]]
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index cbde026adeb..01a73cd0816 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -27,11 +27,10 @@ import 
org.apache.hudi.DataSourceOptionsHelper.fetchMissingWriteConfigsFromTable
 import 
org.apache.hudi.DataSourceUtils.tryOverrideParquetWriteLegacyFormatProperty
 import org.apache.hudi.DataSourceWriteOptions._
 import org.apache.hudi.HoodieConversionUtils.{toProperties, toScalaOption}
-import org.apache.hudi.HoodieSparkSqlWriter.{CANONICALIZE_SCHEMA, 
SQL_MERGE_INTO_WRITES, StreamingWriteParams}
+import org.apache.hudi.HoodieSparkSqlWriter.StreamingWriteParams
 import org.apache.hudi.HoodieWriterUtils._
-import org.apache.hudi.avro.AvroSchemaUtils.{isCompatibleProjectionOf, 
isSchemaCompatible, isValidEvolutionOf, resolveNullableSchema}
+import org.apache.hudi.avro.AvroSchemaUtils.resolveNullableSchema
 import org.apache.hudi.avro.HoodieAvroUtils
-import org.apache.hudi.avro.HoodieAvroUtils.removeMetadataFields
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
 import org.apache.hudi.commit.{DatasetBulkInsertCommitActionExecutor, 
DatasetBulkInsertOverwriteCommitActionExecutor, 
DatasetBulkInsertOverwriteTableCommitActionExecutor}
@@ -49,12 +48,11 @@ import org.apache.hudi.common.util.{CommitUtils, 
StringUtils, Option => HOption}
 import org.apache.hudi.config.HoodieBootstrapConfig.{BASE_PATH, 
INDEX_CLASS_NAME}
 import 
org.apache.hudi.config.HoodieWriteConfig.SPARK_SQL_MERGE_INTO_PREPPED_KEY
 import org.apache.hudi.config.{HoodieCompactionConfig, HoodieInternalConfig, 
HoodieWriteConfig}
-import org.apache.hudi.exception.{HoodieException, 
HoodieWriteConflictException, SchemaCompatibilityException}
+import org.apache.hudi.exception.{HoodieException, 
HoodieWriteConflictException}
 import org.apache.hudi.hive.{HiveSyncConfigHolder, HiveSyncTool}
 import org.apache.hudi.internal.schema.InternalSchema
 import org.apache.hudi.internal.schema.convert.AvroInternalSchemaConverter
-import 
org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchemaRequirements
-import org.apache.hudi.internal.schema.utils.{AvroSchemaEvolutionUtils, 
SerDeHelper}
+import org.apache.hudi.internal.schema.utils.SerDeHelper
 import org.apache.hudi.keygen.constant.KeyGeneratorType
 import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
 import 
org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.getKeyGeneratorClassName
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala
index 

Re: [PR] [MINOR] Remove unused import [hudi]

2023-11-22 Thread via GitHub


leesf merged PR #10159:
URL: https://github.com/apache/hudi/pull/10159


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Async Clustering: Seeking Help on Specific Partitioning and Regex Pattern [hudi]

2023-11-22 Thread via GitHub


soumilshah1995 opened a new issue, #10165:
URL: https://github.com/apache/hudi/issues/10165

   Subject : Async Clustering: Seeking Help on Specific Partitioning and Regex 
Pattern
   
   
   
   I'm currently exploring async clustering in Apache Hudi, and this is also 
intended for a community video. I've successfully executed async clustering, 
but I have a question regarding how to cluster specific partitions or 
partitions based on a regex pattern.
   After reviewing the documentation at 
https://hudi.apache.org/docs/next/clustering/, I came across the following 
setting
   
   hoodie.clustering.plan.strategy.partition.selected N/A (Required) 
Comma-separated list of
   I attempted to use this, but when I run the "show clustering" command, I 
observe the following result:
   
   ```
   +-++-+---+
   | timestamp   | input_group_size| state   | involved_partitions|
   +-++-+---+
   |20231122190057844| 1  | COMPLETED | *|
   +-++-+---+
   ```
   
   
   As you can see, the involved partition shows '*', and I'm perplexed as to 
why it applied to all partitions rather than specific ones.
   Here's my Spark submit command:
   
   
   spark-submit \
 --class org.apache.hudi.utilities.HoodieClusteringJob \
 --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \
 --properties-file spark-config.properties \
 --master 'local[*]' \
 --executor-memory 1g \
 jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
 --mode scheduleAndExecute \
 --base-path file:///Users/soumilnitinshah/Downloads/hudidb/silver/  \
 --table-name orders \
 --hoodie-conf hoodie.clustering.async.enabled=true \
 --hoodie-conf hoodie.clustering.async.max.commits=2 \
 --hoodie-conf 
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 \
 --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=629145600 \
 --hoodie-conf 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
 \
 --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=order_date \
 --hoodie-conf 
hoodie.clustering.plan.strategy.partition.selected=2023-10-23  \
 --hoodie-conf hoodie.write.concurrency.mode=optimistic_concurrency_control 
\
 --hoodie-conf 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
   Additionally, I'm curious about how to cluster partitions based on a regex 
pattern using the setting:
   hoodie.clustering.plan.strategy.partition.regex.pattern
   
   Is it correct to set the value to "2023-10-[0-9]" to match all partitions 
based on this pattern? I appreciate any insights or guidance on this matter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823692854

   
   ## CI report:
   
   * 48df6bbec2473dbbbedb1b723896acb17056e80f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21076)
 
   * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


rmahindra123 commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823692075

   Approved 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10073:
URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823688411

   
   ## CI report:
   
   * 48df6bbec2473dbbbedb1b723896acb17056e80f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21076)
 
   * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] update disaster recovery docs [hudi]

2023-11-22 Thread via GitHub


sagarlakshmipathy opened a new pull request, #10164:
URL: https://github.com/apache/hudi/pull/10164

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   > added for loop to avoid copy pasting code
   > added note to make sure users replace the commit and savepoint ts
   > made markdown edits
   > fixed indentation
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   > documentation change
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   > low risk - documentation change
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   > documentation change
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable - local `npm run build` and 
`npm run serve` passed
   - [NA] CI passed - CI runs after raising PR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Asf site update disaster recovery doc [hudi]

2023-11-22 Thread via GitHub


sagarlakshmipathy closed pull request #10163: Asf site update disaster recovery 
doc
URL: https://github.com/apache/hudi/pull/10163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10014:
URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823617478

   
   ## CI report:
   
   * 5e60b3d12b40a04006d3697fa99538e9e494b96c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823590565

   
   ## CI report:
   
   * cae921ac9d016d28b87139b5c0fd24debadf1592 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21109)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]

2023-11-22 Thread via GitHub


nsivabalan commented on code in PR #10095:
URL: https://github.com/apache/hudi/pull/10095#discussion_r1402767000


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -70,6 +72,7 @@
 public class S3EventsHoodieIncrSource extends HoodieIncrSource {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(S3EventsHoodieIncrSource.class);
+  private static final String EMPTY_STRING = "";

Review Comment:
   StringUtils already contains EMPTY_STRING



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -133,7 +136,13 @@ public S3EventsHoodieIncrSource(
 this.srcPath = getStringWithAltKeys(props, HOODIE_SRC_BASE_PATH);
 this.numInstantsPerFetch = getIntWithAltKeys(props, 
NUM_INSTANTS_PER_FETCH);
 this.checkIfFileExists = getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK);
-this.fileFormat = getStringWithAltKeys(props, DATAFILE_FORMAT, true);
+
+// This is to ensure backward compatibility where we were using the
+// config SOURCE_FILE_FORMAT for file format in previous versions.
+this.fileFormat = Strings.isNullOrEmpty(getStringWithAltKeys(props, 
DATAFILE_FORMAT, EMPTY_STRING))
+? getStringWithAltKeys(props, SOURCE_FILE_FORMAT, true)
+: getStringWithAltKeys(props, DATAFILE_FORMAT, EMPTY_STRING);

Review Comment:
   for last one
   ```
   getStringWithAltKeys(props, DATAFILE_FORMAT, EMPTY_STRING);
   ```
   you can ignore the last arg. 
   ```
   getStringWithAltKeys(props, DATAFILE_FORMAT);
   ```
   
   default value will not be picked. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Asf site update disaster recovery doc [hudi]

2023-11-22 Thread via GitHub


sagarlakshmipathy opened a new pull request, #10163:
URL: https://github.com/apache/hudi/pull/10163

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   > added for loops in 2 places to avoid copy pasting effort
   > fixed indentation in 3 places
   > added note to make sure users replace the commit and savepoint timestamp
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   > doc update - no impact
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   > doc update - no impact
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   > doc update was the goal
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [local testing] Adequate tests were added if applicable
   - [NA] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10095:
URL: https://github.com/apache/hudi/pull/10095#issuecomment-1823585243

   
   ## CI report:
   
   * a6476f06265d7600755e5597af173fea6db2954f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21093)
 
   * c1b5bd41ac1f4be476fb69f84f7197a27733eb23 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10095:
URL: https://github.com/apache/hudi/pull/10095#issuecomment-1823579673

   
   ## CI report:
   
   * a6476f06265d7600755e5597af173fea6db2954f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21093)
 
   * c1b5bd41ac1f4be476fb69f84f7197a27733eb23 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format [hudi]

2023-11-22 Thread via GitHub


nsivabalan commented on code in PR #9567:
URL: https://github.com/apache/hudi/pull/9567#discussion_r1402761264


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java:
##
@@ -241,31 +246,86 @@ private List 
listStatusForSnapshotMode(JobConf job,
   boolean shouldIncludePendingCommits =
   HoodieHiveUtils.shouldIncludePendingCommits(job, 
tableMetaClient.getTableConfig().getTableName());
 
-  HiveHoodieTableFileIndex fileIndex =
-  new HiveHoodieTableFileIndex(
-  engineContext,
-  tableMetaClient,
-  props,
-  HoodieTableQueryType.SNAPSHOT,
-  partitionPaths,
-  queryCommitInstant,
-  shouldIncludePendingCommits);
-
-  Map> partitionedFileSlices = 
fileIndex.listFileSlices();
-
-  targetFiles.addAll(
-  partitionedFileSlices.values()
-  .stream()
-  .flatMap(Collection::stream)
-  .filter(fileSlice -> checkIfValidFileSlice(fileSlice))
-  .map(fileSlice -> createFileStatusUnchecked(fileSlice, 
fileIndex, tableMetaClient))
-  .collect(Collectors.toList())
-  );
+  if (HoodieTableMetadataUtil.isFilesPartitionAvailable(tableMetaClient) 
|| conf.getBoolean(ENABLE.key(), ENABLE.defaultValue())) {

Review Comment:
   default value for metadata is false for reader, and true for writer. So, we 
should use the reader side default here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823515572

   
   ## CI report:
   
   * 597f6d7bd7134d635ad5a675bd398ba03faafef8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21096)
 
   * cae921ac9d016d28b87139b5c0fd24debadf1592 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21109)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823471080

   
   ## CI report:
   
   * 597f6d7bd7134d635ad5a675bd398ba03faafef8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21096)
 
   * cae921ac9d016d28b87139b5c0fd24debadf1592 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10162:
URL: https://github.com/apache/hudi/pull/10162#issuecomment-1823447272

   
   ## CI report:
   
   * 64589da09eb106b1fc771ca77b64d30c81ae5970 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21106)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10014:
URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823446823

   
   ## CI report:
   
   * 80725367a7e21160545ffa27ec1275a32e47e7c4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21092)
 
   * 5e60b3d12b40a04006d3697fa99538e9e494b96c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10014:
URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823391577

   
   ## CI report:
   
   * 80725367a7e21160545ffa27ec1275a32e47e7c4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21092)
 
   * 5e60b3d12b40a04006d3697fa99538e9e494b96c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Schema evolution error: promoted data type from integer to double [hudi]

2023-11-22 Thread via GitHub


kenny291 opened a new issue, #3558:
URL: https://github.com/apache/hudi/issues/3558

   **Description**
   
   Hi all,
   I tested schema evolution change data type from int to double, but it did 
not work with Hudi.
   (hudi doc: 
https://github.com/apache/hudi/blob/asf-site/website/docs/schema_evolution.md).
   I also tried to test change data type from float to double, it had the same 
error.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. init spark context
   ```
   ./spark-shell \
 --packages 
org.apache.spark:spark-avro_2.12:3.1.2,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'\
 --conf 'spark.hadoop.fs.s3a.access.key=xx'\
 --conf 'spark.hadoop.fs.s3a.secret.key=xx'\
 --conf 'spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem'\
 --conf 'spark.hadoop.fs.s3a.endpoint=s3.amazonaws.com'\
 --conf 
'spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider'\
 --conf 'spark.hadoop.fs.s3a.fast.upload=true'\
 --conf 'spark.hadoop.fs.s3a.multiobjectdelete.enable=false'\
 --conf 'spark.sql.parquet.filterPushdown=true'\
 --conf 'spark.sql.parquet.mergeSchema=false'\
 --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2'\
 --conf 'spark.speculation=false'\
 --conf 'hive.metastore.schema.verification=false'\
 --conf 'hive.metastore.schema.verification.record.version=false'\
 --conf spark.sql.hive.convertMetastoreParquet=false
   ```
   
   2. create base hudi table
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.Row
   
   val tableName = "hudi_trips_cow"
   val basePath = "s3a://data-lake/hudi_test/hudi_trips_cow_schema_change"
   val schema = StructType( Array(
StructField("rowId", StringType,true),
StructField("partitionId", StringType,true),
StructField("preComb", LongType,true),
StructField("name", StringType,true),
StructField("versionId", StringType,true),
StructField("intToLong", IntegerType,true),//ok
StructField("intToDouble", IntegerType,true),
StructField("longToFloat", LongType,true),//ok
// StructField("longToDouble", IntegerType,true),
StructField("floatToDouble", FloatType,true)
)) // 9 cols
   
   val data1 = Seq(Row("row_1", "part_0", 0L, "bob", "v_0", 0, 1, 1L, 1.1f),
   Row("row_2", "part_0", 0L, "john", "v_0", 0, 1, 2L, 1.2f),
   Row("row_3", "part_3", 0L, "tom", "v_0", 0, 1, 3L, 1.3f))
   
   var dfFromData1 = spark.createDataFrame(data1, schema)
   dfFromData1.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
  option(RECORDKEY_FIELD_OPT_KEY, "rowId").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId").
  option("hoodie.index.type","SIMPLE").
  option("hoodie.datasource.write.hive_style_partitioning", true).
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)
   ```
   
   3. Change column `intToDouble` data type from int to double and append new 
data to old table.
   ```
   // Int to double
   val newSchema = StructType( Array(
StructField("rowId", StringType,true),
StructField("partitionId", StringType,true),
StructField("preComb", LongType,true),
StructField("name", StringType,true),
StructField("versionId", StringType,true),
StructField("intToLong", IntegerType,true),
StructField("intToDouble", DoubleType,true),
StructField("longToFloat", LongType,true),
// StructField("longToDouble", IntegerType,true),
StructField("floatToDouble", FloatType,true)
)) // 9 col
   
   val data2 = Seq(Row("row_2", "part_0", 5L, "john", "v_3", 3, 1D, 2l, 1.8f),
   Row("row_5", "part_0", 5L, "maroon", "v_2", 2, 1D, 2l, 1.8f),
   Row("row_9", "part_9", 5L, "michael", "v_2", 2, 1D, 2l, 
1.8f))
   
   var dfFromData2 = spark.createDataFrame(data2, newSchema)
   
   dfFromData2.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
  option(RECORDKEY_FIELD_OPT_KEY, "rowId").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId").
  option("hoodie.datasource.write.hive_style_partitioning", true).
  option("hoodie.index.type","SIMPLE").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)
   ```
   
   4. Read hudi table 

Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10157:
URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823352506

   
   ## CI report:
   
   * 3b24d4130099aab67c76de81f77701c730f2e78a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7123] Improve CI scripts (#10136)

2023-11-22 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f88a73f09e7 [HUDI-7123] Improve CI scripts (#10136)
f88a73f09e7 is described below

commit f88a73f09e753ffe0a8029b490a8a430b05b8eaf
Author: Y Ethan Guo 
AuthorDate: Wed Nov 22 10:48:48 2023 -0800

[HUDI-7123] Improve CI scripts (#10136)

Improves the CI scripts in the following aspects:
- Removes `hudi-common` tests from `test-spark` job in GH CI as they are 
already covered by Azure CI
- Removes unnecesary bundle validation jobs and adds new bundle validation 
images (`flink1153hive313spark323`, `flink1162hive313spark331`)
- Updates `validate-release-candidate-bundles` jobs
- Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 
(3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time.
---
 .github/workflows/bot.yml  | 30 +-
 azure-pipelines-20230430.yml   |  6 +++--
 .../base/build_flink1153hive313spark323.sh | 26 +++
 .../base/build_flink1162hive313spark331.sh | 26 +++
 packaging/bundle-validation/ci_run.sh  | 20 ++-
 5 files changed, 88 insertions(+), 20 deletions(-)

diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml
index cff377ed13f..67c7ac16eaa 100644
--- a/.github/workflows/bot.yml
+++ b/.github/workflows/bot.yml
@@ -98,7 +98,7 @@ jobs:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
   SPARK_PROFILE: ${{ matrix.sparkProfile }}
 run:
-  mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS -am -pl 
"hudi-examples/hudi-examples-spark,hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES"
+  mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS -am -pl 
"hudi-examples/hudi-examples-spark,$SPARK_COMMON_MODULES,$SPARK_MODULES"
   - name: Quickstart Test
 env:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
@@ -112,7 +112,7 @@ jobs:
   SPARK_MODULES: ${{ matrix.sparkModules }}
 if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
 run:
-  mvn test -Punit-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl 
"hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+  mvn test -Punit-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl 
"$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
   - name: FT - Spark
 env:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
@@ -299,19 +299,13 @@ jobs:
   - flinkProfile: 'flink1.18'
 sparkProfile: 'spark3.4'
 sparkRuntime: 'spark3.4.0'
-  - flinkProfile: 'flink1.18'
-sparkProfile: 'spark3.3'
-sparkRuntime: 'spark3.3.2'
   - flinkProfile: 'flink1.17'
 sparkProfile: 'spark3.3'
 sparkRuntime: 'spark3.3.2'
   - flinkProfile: 'flink1.16'
-sparkProfile: 'spark3.3'
-sparkRuntime: 'spark3.3.2'
-  - flinkProfile: 'flink1.15'
 sparkProfile: 'spark3.3'
 sparkRuntime: 'spark3.3.1'
-  - flinkProfile: 'flink1.14'
+  - flinkProfile: 'flink1.15'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
   - flinkProfile: 'flink1.14'
@@ -380,18 +374,30 @@ jobs:
 strategy:
   matrix:
 include:
-  - flinkProfile: 'flink1.16'
+  - flinkProfile: 'flink1.18'
 sparkProfile: 'spark3'
+sparkRuntime: 'spark3.5.0'
+  - flinkProfile: 'flink1.18'
+sparkProfile: 'spark3.5'
+sparkRuntime: 'spark3.5.0'
+  - flinkProfile: 'flink1.18'
+sparkProfile: 'spark3.4'
+sparkRuntime: 'spark3.4.0'
+  - flinkProfile: 'flink1.17'
+sparkProfile: 'spark3.3'
 sparkRuntime: 'spark3.3.2'
-  - flinkProfile: 'flink1.15'
+  - flinkProfile: 'flink1.16'
 sparkProfile: 'spark3.3'
 sparkRuntime: 'spark3.3.1'
-  - flinkProfile: 'flink1.14'
+  - flinkProfile: 'flink1.15'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.1'
 sparkRuntime: 'spark3.1.3'
+  - flinkProfile: 'flink1.14'
+sparkProfile: 'spark3.0'
+sparkRuntime: 'spark3.0.2'
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark'
 sparkRuntime: 'spark2.4.8'
diff --git a/azure-pipelines-20230430.yml b/azure-pipelines-20230430.yml
index 21c6d932ef9..c2a5f9d5a44 100644
--- a/azure-pipelines-20230430.yml
+++ 

Re: [PR] [HUDI-7123] Improve CI scripts [hudi]

2023-11-22 Thread via GitHub


yihua merged PR #10136:
URL: https://github.com/apache/hudi/pull/10136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10162:
URL: https://github.com/apache/hudi/pull/10162#issuecomment-1823196032

   
   ## CI report:
   
   * 64589da09eb106b1fc771ca77b64d30c81ae5970 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21106)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10157:
URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823195959

   
   ## CI report:
   
   * 1ecd7d0aaf9a406be3d134a0202911a7b32f05bd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21101)
 
   * 3b24d4130099aab67c76de81f77701c730f2e78a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7135) Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory

2023-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7135:
-
Labels: pull-request-available  (was: )

> Spark reads hudi table error when flink creates the table without preCombine 
> fields by catalog or factory
> -
>
> Key: HUDI-7135
> URL: https://issues.apache.org/jira/browse/HUDI-7135
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
>
> Create a table through dfs catalog, hms catalog, or sink ddl, and then query 
> the data of the table through spark, and an exception occurs:
> Java. util. NoSuchElementException: key not found: ts
> demo:
>  1. create a table through hms catalog:
> {panel:title=hms catalog create table}
> CREATE CATALOG hudi_catalog WITH(
> 'type' = 'hudi',
> 'mode' = 'hms'
> );
> CREATE TABLE hudi_catalog.`default`.ct1
> (
>   f1 string,
>   f2 string
> ) WITH (
>   'connector' = 'hudi',
>   'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ct1',
>   'table.type' = 'COPY_ON_WRITE',
>   'write.operation' = 'insert'
> );
> {panel}
> 2. spark query
> {panel:title=spark query}
> select * from ct1
> {panel}
> 3. exception
> {panel:title=exception}
> java.util.NoSuchElementException: key not found: ts
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10157:
URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823184561

   
   ## CI report:
   
   * 1ecd7d0aaf9a406be3d134a0202911a7b32f05bd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21101)
 
   * 3b24d4130099aab67c76de81f77701c730f2e78a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10162:
URL: https://github.com/apache/hudi/pull/10162#issuecomment-1823184657

   
   ## CI report:
   
   * 64589da09eb106b1fc771ca77b64d30c81ae5970 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove unused import [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10159:
URL: https://github.com/apache/hudi/pull/10159#issuecomment-1823173019

   
   ## CI report:
   
   * 72e6a610b88f3d269477fd967b970c48fbc6f387 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21103)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing

2023-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7136:
-
Labels: pull-request-available  (was: )

> in the dfs catalog scenario, solve the problem of Primary key definition is 
> missing
> ---
>
> Key: HUDI-7136
> URL: https://issues.apache.org/jira/browse/HUDI-7136
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
>
> in the dfs catalog scenario, solve the problem of Primary key definition is 
> missing
> demo:
> {code:java}
> // sql
> CREATE CATALOG hudi_catalog WITH(
> 'type' = 'hudi',
> 'catalog.path' = 'file:///Users/x/x/others/data/hudi-warehouse/',
> 'mode' = 'dfs'
> );
> CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4
> (
> f1 string,
> f2 string,
> primary key (f1) not enforced
> ) WITH (
> 'connector' = 'hudi',
> 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ctn4',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.datasource.write.recordkey.field' = 'f1'
> ) ;
>  {code}
> exception:
> {code:java}
> Primary key definition is missing {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]

2023-11-22 Thread via GitHub


empcl opened a new pull request, #10162:
URL: https://github.com/apache/hudi/pull/10162

   …ion is missing
   
   ### Change Logs
   
   in the dfs catalog scenario, solve the problem of Primary key definition is 
missing
   ### Impact
   
   no
   
   ### Risk level (write none, low medium or high below)
   
   no
   
   ### Documentation Update
   
   no
   
   ### Contributor's checklist
   
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing

2023-11-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

陈磊 updated HUDI-7136:
-
Description: 
in the dfs catalog scenario, solve the problem of Primary key definition is 
missing

demo:
{code:java}
// sql
CREATE CATALOG hudi_catalog WITH(
'type' = 'hudi',
'catalog.path' = 
'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/',
'mode' = 'dfs'
);

CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4
(
f1 string,
f2 string,
primary key (f1) not enforced
) WITH (
'connector' = 'hudi',
'path' = 
'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/ctn4',
'table.type' = 'MERGE_ON_READ',
'write.operation' = 'upsert',
'hoodie.datasource.write.recordkey.field' = 'f1'
) ;
 {code}
exception:

  was:
in the dfs catalog scenario, solve the problem of Primary key definition is 
missing

demo:


> in the dfs catalog scenario, solve the problem of Primary key definition is 
> missing
> ---
>
> Key: HUDI-7136
> URL: https://issues.apache.org/jira/browse/HUDI-7136
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>
> in the dfs catalog scenario, solve the problem of Primary key definition is 
> missing
> demo:
> {code:java}
> // sql
> CREATE CATALOG hudi_catalog WITH(
> 'type' = 'hudi',
> 'catalog.path' = 
> 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/',
> 'mode' = 'dfs'
> );
> CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4
> (
> f1 string,
> f2 string,
> primary key (f1) not enforced
> ) WITH (
> 'connector' = 'hudi',
> 'path' = 
> 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/ctn4',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.datasource.write.recordkey.field' = 'f1'
> ) ;
>  {code}
> exception:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing

2023-11-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

陈磊 updated HUDI-7136:
-
Description: 
in the dfs catalog scenario, solve the problem of Primary key definition is 
missing

demo:
{code:java}
// sql
CREATE CATALOG hudi_catalog WITH(
'type' = 'hudi',
'catalog.path' = 'file:///Users/x/x/others/data/hudi-warehouse/',
'mode' = 'dfs'
);

CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4
(
f1 string,
f2 string,
primary key (f1) not enforced
) WITH (
'connector' = 'hudi',
'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ctn4',
'table.type' = 'MERGE_ON_READ',
'write.operation' = 'upsert',
'hoodie.datasource.write.recordkey.field' = 'f1'
) ;
 {code}
exception:
{code:java}
Primary key definition is missing {code}

  was:
in the dfs catalog scenario, solve the problem of Primary key definition is 
missing

demo:
{code:java}
// sql
CREATE CATALOG hudi_catalog WITH(
'type' = 'hudi',
'catalog.path' = 
'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/',
'mode' = 'dfs'
);

CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4
(
f1 string,
f2 string,
primary key (f1) not enforced
) WITH (
'connector' = 'hudi',
'path' = 
'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/ctn4',
'table.type' = 'MERGE_ON_READ',
'write.operation' = 'upsert',
'hoodie.datasource.write.recordkey.field' = 'f1'
) ;
 {code}
exception:


> in the dfs catalog scenario, solve the problem of Primary key definition is 
> missing
> ---
>
> Key: HUDI-7136
> URL: https://issues.apache.org/jira/browse/HUDI-7136
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>
> in the dfs catalog scenario, solve the problem of Primary key definition is 
> missing
> demo:
> {code:java}
> // sql
> CREATE CATALOG hudi_catalog WITH(
> 'type' = 'hudi',
> 'catalog.path' = 'file:///Users/x/x/others/data/hudi-warehouse/',
> 'mode' = 'dfs'
> );
> CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4
> (
> f1 string,
> f2 string,
> primary key (f1) not enforced
> ) WITH (
> 'connector' = 'hudi',
> 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ctn4',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.datasource.write.recordkey.field' = 'f1'
> ) ;
>  {code}
> exception:
> {code:java}
> Primary key definition is missing {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing

2023-11-22 Thread Jira
陈磊 created HUDI-7136:


 Summary: in the dfs catalog scenario, solve the problem of Primary 
key definition is missing
 Key: HUDI-7136
 URL: https://issues.apache.org/jira/browse/HUDI-7136
 Project: Apache Hudi
  Issue Type: Bug
Reporter: 陈磊


in the dfs catalog scenario, solve the problem of Primary key definition is 
missing

demo:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] Cannot encode decimal with precision 15 as max precision 14 [hudi]

2023-11-22 Thread via GitHub


ad1happy2go commented on issue #10160:
URL: https://github.com/apache/hudi/issues/10160#issuecomment-1823026376

   @njalan I remember a similar issue before also, This issue got fixed in this 
PR  - https://github.com/apache/hudi/pull/8063
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7135) Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory

2023-11-22 Thread Jira
陈磊 created HUDI-7135:


 Summary: Spark reads hudi table error when flink creates the table 
without preCombine fields by catalog or factory
 Key: HUDI-7135
 URL: https://issues.apache.org/jira/browse/HUDI-7135
 Project: Apache Hudi
  Issue Type: Bug
Reporter: 陈磊


Create a table through dfs catalog, hms catalog, or sink ddl, and then query 
the data of the table through spark, and an exception occurs:
Java. util. NoSuchElementException: key not found: ts

demo:
 1. create a table through hms catalog:

{panel:title=hms catalog create table}
CREATE CATALOG hudi_catalog WITH(
'type' = 'hudi',
'mode' = 'hms'
);

CREATE TABLE hudi_catalog.`default`.ct1
(
f1 string,
f2 string
) WITH (
'connector' = 'hudi',
'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ct1',
'table.type' = 'COPY_ON_WRITE',
'write.operation' = 'insert'
);
{panel}

2. spark query

{panel:title=spark query}
select * from ct1
{panel}

3. exception

{panel:title=exception}
java.util.NoSuchElementException: key not found: ts
{panel}






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7034] Refresh index fix - remove cached file slices within part… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10151:
URL: https://github.com/apache/hudi/pull/10151#issuecomment-1822983761

   
   ## CI report:
   
   * 190b9df539423cb5da8f01b400426d9e97f7bab4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21098)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (#10152)

2023-11-22 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 38c87b7ebe1 [HUDI-7004] Add support of snapshotLoadQuerySplitter in 
s3/gcs sources (#10152)
38c87b7ebe1 is described below

commit 38c87b7ebe148e8870db83be433376ad89b9c048
Author: harshal 
AuthorDate: Wed Nov 22 20:53:42 2023 +0530

[HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources 
(#10152)
---
 .../apache/hudi/common/config/TypedProperties.java |  5 ++
 .../sources/GcsEventsHoodieIncrSource.java |  7 +-
 .../hudi/utilities/sources/HoodieIncrSource.java   |  6 +-
 .../sources/S3EventsHoodieIncrSource.java  |  9 ++-
 .../sources/SnapshotLoadQuerySplitter.java |  9 +++
 .../utilities/sources/helpers/QueryRunner.java | 35 +
 .../sources/TestGcsEventsHoodieIncrSource.java | 85 --
 .../sources/TestS3EventsHoodieIncrSource.java  | 78 ++--
 8 files changed, 198 insertions(+), 36 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java 
b/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java
index 3db8210cade..86b7f4cc457 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.common.config;
 
+import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.StringUtils;
 
 import java.io.Serializable;
@@ -78,6 +79,10 @@ public class TypedProperties extends Properties implements 
Serializable {
 return containsKey(property) ? getProperty(property) : defaultValue;
   }
 
+  public Option getNonEmptyStringOpt(String property, String 
defaultValue) {
+return Option.ofNullable(StringUtils.emptyToNull(getString(property, 
defaultValue)));
+  }
+
   public List getStringList(String property, String delimiter, 
List defaultVal) {
 if (!containsKey(property)) {
   return defaultVal;
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
index d09bad71916..a06130d3972 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
@@ -114,6 +114,7 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
   private final CloudDataFetcher gcsObjectDataFetcher;
   private final QueryRunner queryRunner;
   private final Option schemaProvider;
+  private final Option snapshotLoadQuerySplitter;
 
 
   public static final String GCS_OBJECT_KEY = "name";
@@ -145,6 +146,7 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
 this.gcsObjectDataFetcher = gcsObjectDataFetcher;
 this.queryRunner = queryRunner;
 this.schemaProvider = Option.ofNullable(schemaProvider);
+this.snapshotLoadQuerySplitter = 
SnapshotLoadQuerySplitter.getInstance(props);
 
 LOG.info("srcPath: " + srcPath);
 LOG.info("missingCheckpointStrategy: " + missingCheckpointStrategy);
@@ -171,8 +173,9 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
   return Pair.of(Option.empty(), queryInfo.getStartInstant());
 }
 
-Dataset cloudObjectMetadataDF = queryRunner.run(queryInfo);
-Dataset filteredSourceData = 
gcsObjectMetadataFetcher.applyFilter(cloudObjectMetadataDF);
+Pair> queryInfoDatasetPair = 
queryRunner.run(queryInfo, snapshotLoadQuerySplitter);
+Dataset filteredSourceData = 
gcsObjectMetadataFetcher.applyFilter(queryInfoDatasetPair.getRight());
+queryInfo = queryInfoDatasetPair.getLeft();
 LOG.info("Adjusting end checkpoint:" + queryInfo.getEndInstant() + " based 
on sourceLimit :" + sourceLimit);
 Pair>> checkPointAndDataset 
=
 IncrSourceHelper.filterAndGenerateCheckpointBasedOnSourceLimit(
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
index 1d302fa106b..f87e5c231bf 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
@@ -25,7 +25,6 @@ import org.apache.hudi.common.model.HoodieRecord;
 import 
org.apache.hudi.common.table.timeline.TimelineUtils.HollowCommitHandling;
 import org.apache.hudi.common.util.CollectionUtils;
 import org.apache.hudi.common.util.Option;
-import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.collection.Pair;
 

Re: [PR] [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources [hudi]

2023-11-22 Thread via GitHub


nsivabalan merged PR #10152:
URL: https://github.com/apache/hudi/pull/10152


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (cda9dbca206 -> d0edfb55ca2)

2023-11-22 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from cda9dbca206 [HUDI-7129] Fix bug when upgrade from table version three 
using UpgradeOrDowngradeProcedure (#10147)
 add d0edfb55ca2 [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor 
deletion based on meta field as well as custome delete marker (#10150)

No new revisions were added by this update.

Summary of changes:
 .../common/model/DefaultHoodieRecordPayload.java   | 29 --
 .../model/TestDefaultHoodieRecordPayload.java  |  9 ++-
 2 files changed, 35 insertions(+), 3 deletions(-)



Re: [PR] [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker [hudi]

2023-11-22 Thread via GitHub


nsivabalan merged PR #10150:
URL: https://github.com/apache/hudi/pull/10150


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10157:
URL: https://github.com/apache/hudi/pull/10157#issuecomment-1822968869

   
   ## CI report:
   
   * 1ecd7d0aaf9a406be3d134a0202911a7b32f05bd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21101)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10120:
URL: https://github.com/apache/hudi/pull/10120#issuecomment-1822968498

   
   ## CI report:
   
   * a7f986bd546e2c38c241ee743734dbec491b0351 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21099)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]

2023-11-22 Thread via GitHub


empcl closed pull request #10161: in the dfs catalog scenario, solve the 
problem of Primary key definit…
URL: https://github.com/apache/hudi/pull/10161


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]

2023-11-22 Thread via GitHub


empcl opened a new pull request, #10161:
URL: https://github.com/apache/hudi/pull/10161

   …ion is missing
   
   ### Change Logs
   
   in the dfs catalog scenario, solve the problem of Primary key definition is 
missing
   
   ### Impact
   
   no
   
   ### Risk level (write none, low medium or high below)
   
   no
   
   ### Documentation Update
   
   no
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1822878847

   
   ## CI report:
   
   * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1822849025

   
   ## CI report:
   
   * 597f6d7bd7134d635ad5a675bd398ba03faafef8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21096)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Cannot encode decimal with precision 15 as max precision 14 [hudi]

2023-11-22 Thread via GitHub


njalan opened a new issue, #10160:
URL: https://github.com/apache/hudi/issues/10160

   Got below error message when try to load data from postgresql into hudi, But 
it is working fine on hudi 0.9.
   
   
   Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.avro.AvroTypeException: Cannot encode decimal with precision 15 as 
max precision 14
   at 
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73)
   at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:154)
   ... 31 more
   Caused by: org.apache.avro.AvroTypeException: Cannot encode decimal with 
precision 15 as max precision 14
   at 
org.apache.avro.Conversions$DecimalConversion.validate(Conversions.java:140)
   at 
org.apache.avro.Conversions$DecimalConversion.toFixed(Conversions.java:104)
   at 
org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryTypeWithDiffSchemaType(HoodieAvroUtils.java:994)
   at 
org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryType(HoodieAvroUtils.java:921)
   at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:866)
   at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:864)
   at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:822)
   at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:786)
   at 
org.apache.hudi.common.model.HoodieAvroIndexedRecord.rewriteRecordWithNewSchema(HoodieAvroIndexedRecord.java:123)
   at 
org.apache.hudi.common.model.HoodieRecord.rewriteRecordWithNewSchema(HoodieRecord.java:369)
   at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$runMerge$1(HoodieMergeHelper.java:143)
   at 
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:66)
 
   
   I think the source table with column **   numeric(14, 4)**  call the 
issue.
   
   Environment Description
   
   Hudi version : 0.13.1
   
   Spark version : 3.0.1
   
   Hive version : 3.1
   
   Hadoop version : 3.2.2
   
   Storage (HDFS/S3/GCS..) :
   
   Running on Docker? no :


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove unused import [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10159:
URL: https://github.com/apache/hudi/pull/10159#issuecomment-1822776587

   
   ## CI report:
   
   * 72e6a610b88f3d269477fd967b970c48fbc6f387 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21103)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1822776507

   
   ## CI report:
   
   * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove unused import [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10159:
URL: https://github.com/apache/hudi/pull/10159#issuecomment-1822763645

   
   ## CI report:
   
   * 72e6a610b88f3d269477fd967b970c48fbc6f387 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]

2023-11-22 Thread via GitHub


hudi-bot commented on PR #10158:
URL: https://github.com/apache/hudi/pull/10158#issuecomment-1822763563

   
   ## CI report:
   
   * c8c49d513c8b91b2ff8462f6db25203ba563d39a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >