Re: [PR] [HUDI-7034] Refresh index fix - remove cached file slices within part… [hudi]
hudi-bot commented on PR #10151: URL: https://github.com/apache/hudi/pull/10151#issuecomment-1823920564 ## CI report: * 190b9df539423cb5da8f01b400426d9e97f7bab4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21098) * a41d81f6116784b2f006ed4e58ac7a755410f848 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823920316 ## CI report: * 35fed0de0587b411f9470e1c69db43501df5a725 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21121) * 6024de8ab05dd38e9cdb58afc10b70991542c392 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21122) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
hudi-bot commented on PR #10157: URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823920594 ## CI report: * 3b24d4130099aab67c76de81f77701c730f2e78a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
empcl commented on PR #10157: URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823916790 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823913218 ## CI report: * 35fed0de0587b411f9470e1c69db43501df5a725 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21121) * 6024de8ab05dd38e9cdb58afc10b70991542c392 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
hudi-bot commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823907351 ## CI report: * 868ba59ecf1a08d7b73a7121429103c2134b291f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21119) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823907225 ## CI report: * 35fed0de0587b411f9470e1c69db43501df5a725 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21121) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
danny0405 commented on code in PR #10002: URL: https://github.com/apache/hudi/pull/10002#discussion_r1402973135 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/RocksDbBasedFileSystemView.java: ## @@ -553,6 +553,10 @@ protected void removeReplacedFileIdsAtInstants(Set instants) { ); } + protected boolean hasReplacedFilesInPartition(String partitionPath) { +throw new UnsupportedOperationException("isReplacedFileExistWithinSpecifiedPartition() is not supported for RocksDbBasedFileSystemView!"); Review Comment: We must support it correctly. Actually there is no need to override it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
danny0405 commented on code in PR #10002: URL: https://github.com/apache/hudi/pull/10002#discussion_r1402973135 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/RocksDbBasedFileSystemView.java: ## @@ -553,6 +553,10 @@ protected void removeReplacedFileIdsAtInstants(Set instants) { ); } + protected boolean hasReplacedFilesInPartition(String partitionPath) { +throw new UnsupportedOperationException("isReplacedFileExistWithinSpecifiedPartition() is not supported for RocksDbBasedFileSystemView!"); Review Comment: We must support it correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
danny0405 commented on code in PR #10002: URL: https://github.com/apache/hudi/pull/10002#discussion_r1402972184 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java: ## @@ -202,6 +207,13 @@ private Map getParamsWithPartitionPath(String partitionPath) { return paramsMap; } + private Map getParamsWithPartitionPaths(List partitionPaths) { Review Comment: This method is useless now right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
danny0405 commented on code in PR #10002: URL: https://github.com/apache/hudi/pull/10002#discussion_r1402970319 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -598,35 +603,49 @@ private FileSlice filterUncommittedLogs(FileSlice fileSlice) { } protected HoodieFileGroup addBootstrapBaseFileIfPresent(HoodieFileGroup fileGroup) { +return addBootstrapBaseFileIfPresent(fileGroup, this::getBootstrapBaseFile); + } + + protected HoodieFileGroup addBootstrapBaseFileIfPresent(HoodieFileGroup fileGroup, Function> bootstrapBaseFileMappingFunc) { boolean hasBootstrapBaseFile = fileGroup.getAllFileSlices() .anyMatch(fs -> fs.getBaseInstantTime().equals(METADATA_BOOTSTRAP_INSTANT_TS)); if (hasBootstrapBaseFile) { HoodieFileGroup newFileGroup = new HoodieFileGroup(fileGroup); newFileGroup.getAllFileSlices().filter(fs -> fs.getBaseInstantTime().equals(METADATA_BOOTSTRAP_INSTANT_TS)) .forEach(fs -> fs.setBaseFile( - addBootstrapBaseFileIfPresent(fs.getFileGroupId(), fs.getBaseFile().get(; + addBootstrapBaseFileIfPresent(fs.getFileGroupId(), fs.getBaseFile().get(), bootstrapBaseFileMappingFunc))); return newFileGroup; } return fileGroup; } protected FileSlice addBootstrapBaseFileIfPresent(FileSlice fileSlice) { +return addBootstrapBaseFileIfPresent(fileSlice, this::getBootstrapBaseFile); + } + + protected FileSlice addBootstrapBaseFileIfPresent(FileSlice fileSlice, Function> bootstrapBaseFileMappingFunc) { if (fileSlice.getBaseInstantTime().equals(METADATA_BOOTSTRAP_INSTANT_TS)) { FileSlice copy = new FileSlice(fileSlice); copy.getBaseFile().ifPresent(dataFile -> { Option edf = getBootstrapBaseFile(copy.getFileGroupId()); Review Comment: Oops, this line is useless now, can you fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7034] Refresh index fix - remove cached file slices within part… [hudi]
danny0405 commented on PR #10151: URL: https://github.com/apache/hudi/pull/10151#issuecomment-1823886771 Still got some compile issues: ```scala Error: /home/runner/work/hudi/hudi/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala:249: error: overloaded method constructor HoodieBaseFile with alternatives: Error:(x$1: String)org.apache.hudi.common.model.HoodieBaseFile Error:(x$1: org.apache.hadoop.fs.FileStatus)org.apache.hudi.common.model.HoodieBaseFile Error:(x$1: org.apache.hudi.common.model.HoodieBaseFile)org.apache.hudi.common.model.HoodieBaseFile Error: cannot be applied to (org.apache.spark.sql.execution.datasources.FileStatusWithMetadata) Error:files.flatMap(_.files).map(new HoodieBaseFile(_)).map(_.getCommitTime).distinct Error: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) [hudi]
xiaolan-bit commented on issue #6297: URL: https://github.com/apache/hudi/issues/6297#issuecomment-1823881483 is the jar about parquet have question? my flink version is 1.17.1 but the parquet version is 1.13.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) [hudi]
xiaolan-bit commented on issue #6297: URL: https://github.com/apache/hudi/issues/6297#issuecomment-1823880643 when i use select * ,an error appear:java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor at org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.(AbstractColumnReader.java:108) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.(BytesColumnReader.java:35) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:364) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:329) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.readNextRowGroup(ParquetColumnarRowSplitReader.java:334) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.nextBatch(ParquetColumnarRowSplitReader.java:310) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.ensureBatch(ParquetColumnarRowSplitReader.java:292) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.reachedEnd(ParquetColumnarRowSplitReader.java:271) at org.apache.hudi.table.format.ParquetSplitRecordIterator.hasNext(ParquetSplitRecordIterator.java:42) at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.reachedEnd(CopyOnWriteInputFormat.java:283) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) [hudi]
xiaolan-bit commented on issue #6297: URL: https://github.com/apache/hudi/issues/6297#issuecomment-1823878299 how to slove this question? add or replace any jar? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823878156 ## CI report: * fc27baa8c2df9135bc6e4b0d14e50a127ecb434f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21100) * 35fed0de0587b411f9470e1c69db43501df5a725 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] The INSERT records are marked as UPDATE [hudi]
zdl1 commented on issue #10156: URL: https://github.com/apache/hudi/issues/10156#issuecomment-1823873879 > there is no way to figure out whether a key has been written to an existing bucket before, except the first file slice, so all the records are updates. Thanks for the explanation, it really makes sense, I am wondering whether there is a method to get the real number of current records after some delta_commits, I used the numberInserts, numberUpdates and numberDeletes to count the number before, but it seems like this way doesn't work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823873637 ## CI report: * 032ad417971148eec41a5d41066b37d238ecf70a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21114) * 08794fc20eeb2736265520d170d0ee64794a842e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21120) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823868662 ## CI report: * 032ad417971148eec41a5d41066b37d238ecf70a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21114) * 08794fc20eeb2736265520d170d0ee64794a842e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
zhangyue19921010 commented on code in PR #10157: URL: https://github.com/apache/hudi/pull/10157#discussion_r1402949484 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java: ## @@ -510,6 +511,9 @@ private void initTableIfNotExists(ObjectPath tablePath, CatalogTable catalogTabl } flinkConf.setString(FlinkOptions.TABLE_NAME, tablePath.getObjectName()); + +StreamerUtil.checkPreCombineKey(flinkConf, ((ResolvedCatalogTable) catalogTable).getResolvedSchema()); Review Comment: https://github.com/apache/hudi/assets/69956021/cf1e72d6-07ce-4c86-b1a2-3e5a07b556f5;> Looks like UT failed related to this change. please take a look. ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java: ## @@ -510,6 +511,9 @@ private void initTableIfNotExists(ObjectPath tablePath, CatalogTable catalogTabl } flinkConf.setString(FlinkOptions.TABLE_NAME, tablePath.getObjectName()); + +StreamerUtil.checkPreCombineKey(flinkConf, ((ResolvedCatalogTable) catalogTable).getResolvedSchema()); Review Comment: Also it is better to modify/create related UTs to check percombine field in hoodie.properties -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
hudi-bot commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823839975 ## CI report: * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) * 868ba59ecf1a08d7b73a7121429103c2134b291f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21119) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
nsivabalan commented on code in PR #10158: URL: https://github.com/apache/hudi/pull/10158#discussion_r1402927461 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java: ## @@ -801,24 +765,25 @@ private HoodieWriteConfig prepareHoodieConfigForRowWriter(Schema writerSchema) { * * @param instantTime instant time to use for ingest. * @param inputBatch input batch that contains the records, checkpoint, and schema provider - * @param inputIsEmpty true if input batch is empty. * @param metrics Metrics * @param overallTimerContext Timer Context * @return Option Compaction instant if one is scheduled */ - private Pair, JavaRDD> writeToSinkAndDoMetaSync(String instantTime, InputBatch inputBatch, boolean inputIsEmpty, + private Pair, JavaRDD> writeToSinkAndDoMetaSync(String instantTime, InputBatch inputBatch, HoodieIngestionMetrics metrics, Timer.Context overallTimerContext) { Option scheduledCompactionInstant = Option.empty(); // write to hudi and fetch result -Pair writeClientWriteResultIsEmptyPair = writeToSink(inputBatch, instantTime, inputIsEmpty); -JavaRDD writeStatusRDD = writeClientWriteResultIsEmptyPair.getKey().getWriteStatusRDD(); -Map> partitionToReplacedFileIds = writeClientWriteResultIsEmptyPair.getKey().getPartitionToReplacedFileIds(); -boolean isEmpty = writeClientWriteResultIsEmptyPair.getRight(); +WriteClientWriteResult writeClientWriteResult = writeToSink(inputBatch, instantTime); +JavaRDD writeStatusRDD = writeClientWriteResult.getWriteStatusRDD(); +Map> partitionToReplacedFileIds = writeClientWriteResult.getPartitionToReplacedFileIds(); // process write status long totalErrorRecords = writeStatusRDD.mapToDouble(WriteStatus::getTotalErrorRecords).sum().longValue(); long totalRecords = writeStatusRDD.mapToDouble(WriteStatus::getTotalRecords).sum().longValue(); +long totalSuccessfulRecords = totalRecords - totalErrorRecords; +LOG.info(String.format("instantTime=%s, totalRecords=%d, totalErrorRecords=%d, totalSuccessfulRecords=%d", Review Comment: can we have an explicit log statement when there are no records ingested, but just for checkpoint purpose we had to trigger the commit. I guess prior to this patch, we will print "No new data, perform empty commit.". May be something similar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
hudi-bot commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823835454 ## CI report: * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) * 868ba59ecf1a08d7b73a7121429103c2134b291f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7120] Performance improvements in deltastreamer executor code path (#10135)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b77eff2522a [HUDI-7120] Performance improvements in deltastreamer executor code path (#10135) b77eff2522a is described below commit b77eff2522a975b0c456332d20eaea6eed882774 Author: Lokesh Jain AuthorDate: Thu Nov 23 10:47:40 2023 +0530 [HUDI-7120] Performance improvements in deltastreamer executor code path (#10135) --- .../hudi/io/HoodieKeyLocationFetchHandle.java | 4 +- .../org/apache/hudi/AvroConversionUtils.scala | 9 + .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 22 +- .../java/org/apache/hudi/avro/HoodieAvroUtils.java | 58 +++-- .../java/org/apache/hudi/common/fs/FSUtils.java| 9 +- .../org/apache/hudi/TestAvroConversionUtils.scala | 248 +++-- 6 files changed, 186 insertions(+), 164 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java index 135e4866cc5..ab41a94c2a9 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java @@ -62,9 +62,11 @@ public class HoodieKeyLocationFetchHandle extends HoodieReadHandle> locations() { HoodieBaseFile baseFile = partitionPathBaseFilePair.getRight(); +String commitTime = baseFile.getCommitTime(); +String fileId = baseFile.getFileId(); return fetchRecordKeysWithPositions(baseFile).stream() .map(entry -> Pair.of(entry.getLeft(), -new HoodieRecordLocation(baseFile.getCommitTime(), baseFile.getFileId(), entry.getRight(; +new HoodieRecordLocation(commitTime, fileId, entry.getRight(; } public Stream> globalLocations() { diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala index 818bf760047..d84679eaf92 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala @@ -18,6 +18,7 @@ package org.apache.hudi +import org.apache.avro.Schema.Type import org.apache.avro.generic.GenericRecord import org.apache.avro.{JsonProperties, Schema} import org.apache.hudi.HoodieSparkUtils.sparkAdapter @@ -242,4 +243,12 @@ object AvroConversionUtils { val nameParts = qualifiedName.split('.') (nameParts.last, nameParts.init.mkString(".")) } + + private def handleUnion(schema: Schema): Schema = { +if (schema.getType == Type.UNION) { + val index = if (schema.getTypes.get(0).getType == Schema.Type.NULL) 1 else 0 + return schema.getTypes.get(index) +} +schema + } } diff --git a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java index fcfc8a4f0b9..3c5486c47c7 100644 --- a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java +++ b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java @@ -249,6 +249,11 @@ public class AvroSchemaUtils { } List innerTypes = schema.getTypes(); +if (innerTypes.size() == 2 && isNullable(schema)) { + // this is a basic nullable field so handle it more efficiently + return resolveNullableSchema(schema); +} + Schema nonNullType = innerTypes.stream() .filter(it -> it.getType() != Schema.Type.NULL && Objects.equals(it.getFullName(), fieldSchemaFullName)) @@ -286,18 +291,19 @@ public class AvroSchemaUtils { } List innerTypes = schema.getTypes(); -Schema nonNullType = -innerTypes.stream() -.filter(it -> it.getType() != Schema.Type.NULL) -.findFirst() -.orElse(null); -if (innerTypes.size() != 2 || nonNullType == null) { +if (innerTypes.size() != 2) { throw new AvroRuntimeException( String.format("Unsupported Avro UNION type %s: Only UNION of a null type and a non-null type is supported", schema)); } - -return nonNullType; +Schema firstInnerType = innerTypes.get(0); +Schema secondInnerType = innerTypes.get(1); +if ((firstInnerType.getType() != Schema.Type.NULL && secondInnerType.getType() != Schema.Type.NULL) +|| (firstInnerType.getType() == Schema.Type.NULL && secondInnerType.getType() == Schema.Type.NULL)) { + throw new AvroRuntimeException( + String.format("Unsupported Avro UNION type %s: Only UNION of a null type and a non-null type is supported",
Re: [PR] [HUDI-7120] Performance improvements in deltastreamer executor code path [hudi]
nsivabalan merged PR #10135: URL: https://github.com/apache/hudi/pull/10135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (#10095)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 405be173664 [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (#10095) 405be173664 is described below commit 405be173664b724ca941194136a5b5dcff4bb598 Author: Sivabalan Narayanan AuthorDate: Wed Nov 22 21:00:33 2023 -0800 [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (#10095) * Making misc fixes to deltastreamer sources * Fixing test failures * adding inference to CloudSourceconfig... cloud.data.datafile.format * Fix the tests for s3 events source * Fix the tests for s3 events source - Co-authored-by: rmahindra123 --- .../main/java/org/apache/hudi/common/util/StringUtils.java| 10 ++ .../java/org/apache/hudi/common/util/TestStringUtils.java | 7 +++ .../org/apache/hudi/utilities/config/CloudSourceConfig.java | 2 +- .../apache/hudi/utilities/schema/SchemaRegistryProvider.java | 11 +-- .../hudi/utilities/sources/S3EventsHoodieIncrSource.java | 11 ++- 5 files changed, 37 insertions(+), 4 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java index d7d79796aec..5b95bc60312 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java @@ -173,4 +173,14 @@ public class StringUtils { } return input.substring(0, i); } + + public static String truncate(String str, int headLength, int tailLength) { +if (isNullOrEmpty(str) || str.length() <= headLength + tailLength) { + return str; +} +String head = str.substring(0, headLength); +String tail = str.substring(str.length() - tailLength); + +return head + "..." + tail; + } } diff --git a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java index 3bdf6d48b39..54985056bf0 100644 --- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java +++ b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java @@ -114,4 +114,11 @@ public class TestStringUtils { } return sb.toString(); } + + @Test + public void testTruncate() { +assertNull(StringUtils.truncate(null, 10, 10)); +assertEquals("http://use...ons/latest;, StringUtils.truncate("http://username:passw...@myregistry.com:5000/versions/latest;, 10, 10)); +assertEquals("http://abc.com;, StringUtils.truncate("http://abc.com;, 10, 10)); + } } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java index e7b44cf9121..007d36fc704 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java @@ -108,7 +108,7 @@ public class CloudSourceConfig extends HoodieConfig { public static final ConfigProperty DATAFILE_FORMAT = ConfigProperty .key(STREAMER_CONFIG_PREFIX + "source.cloud.data.datafile.format") - .defaultValue("parquet") + .defaultValue(HoodieIncrSourceConfig.SOURCE_FILE_FORMAT.defaultValue()) .withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + "source.cloud.data.datafile.format") .markAdvanced() .withDocumentation("Format of the data file. By default, this will be the same as hoodie.streamer.source.hoodieincr.file.format"); diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java index 780fbb9dc0a..110c8cc2fb1 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java @@ -195,7 +195,10 @@ public class SchemaRegistryProvider extends SchemaProvider { try { return parseSchemaFromRegistry(registryUrl); } catch (Exception e) { - throw new HoodieSchemaFetchException("Error reading source schema from registry :" + registryUrl, e); + throw new HoodieSchemaFetchException(String.format( + "Error reading source schema from registry. Please check %s is configured correctly. Truncated URL: %s", + Config.SRC_SCHEMA_REGISTRY_URL_PROP, + StringUtils.truncate(registryUrl, 10, 10)), e); } } @@ -207,7 +210,11 @@ public class SchemaRegistryProvider extends SchemaProvider { try {
Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]
codope merged PR #10095: URL: https://github.com/apache/hudi/pull/10095 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
danny0405 commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1823820723 Thanks for the contribution, I have reviewed and created a patch: [7041.patch.zip](https://github.com/apache/hudi/files/13446123/7041.patch.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (72ff9a7f0c9 -> 3d212853724)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 72ff9a7f0c9 [HUDI-7052] Fix partition key validation for custom key generators. (#10014) add 3d212853724 [HUDI-7112] Reuse existing timeline server and performance improvements (#10122) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/client/BaseHoodieClient.java | 2 +- .../embedded/EmbeddedTimelineServerHelper.java | 38 + .../client/embedded/EmbeddedTimelineService.java | 172 +-- .../org/apache/hudi/config/HoodieWriteConfig.java | 4 +- .../marker/TimelineServerBasedWriteMarkers.java| 13 +- .../org/apache/hudi/util/HttpRequestClient.java| 12 +- .../embedded/TestEmbeddedTimelineService.java | 189 + .../client/TestHoodieJavaWriteClientInsert.java| 6 +- .../hudi/client/TestHoodieClientMultiWriter.java | 35 +++- .../hudi/client/TestSparkRDDWriteClient.java | 6 +- .../TestRemoteFileSystemViewWithMetadataTable.java | 42 +++-- hudi-common/pom.xml| 4 + .../hudi/common/table/timeline/dto/DTOUtils.java | 4 +- .../view/RemoteHoodieTableFileSystemView.java | 70 .../org/apache/hudi/sink/TestWriteCopyOnWrite.java | 89 ++ .../hudi/sink/TestWriteMergeOnReadWithCompact.java | 8 + .../org/apache/hudi/sink/utils/TestWriteBase.java | 6 +- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 1 + .../hudi/timeline/service/RequestHandler.java | 5 +- .../hudi/timeline/service/TimelineService.java | 8 +- .../timeline/service/handlers/BaseFileHandler.java | 11 +- .../service/handlers/marker/MarkerDirState.java| 3 +- .../apache/hudi/utilities/streamer/StreamSync.java | 2 +- .../deltastreamer/TestHoodieDeltaStreamer.java | 1 - pom.xml| 8 + 25 files changed, 566 insertions(+), 173 deletions(-) create mode 100644 hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/embedded/TestEmbeddedTimelineService.java
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
nsivabalan merged PR #10122: URL: https://github.com/apache/hudi/pull/10122 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
nsivabalan commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823818490 https://github.com/apache/hudi/assets/513218/43a50fef-afef-4a80-b54a-75d5fe1260d3;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7052] Fix partition key validation for custom key generators. (#10014)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 72ff9a7f0c9 [HUDI-7052] Fix partition key validation for custom key generators. (#10014) 72ff9a7f0c9 is described below commit 72ff9a7f0c9a7da12810669ca0111761ee7adcfe Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com> AuthorDate: Wed Nov 22 20:49:15 2023 -0800 [HUDI-7052] Fix partition key validation for custom key generators. (#10014) - Co-authored-by: rmahindra123 --- .../AutoRecordGenWrapperAvroKeyGenerator.java | 27 +--- .../hudi/keygen/AutoRecordKeyGeneratorWrapper.java | 32 +++ .../keygen/AutoRecordGenWrapperKeyGenerator.java | 48 ++ .../org/apache/hudi/util/SparkKeyGenUtils.scala| 31 -- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 4 +- .../scala/org/apache/hudi/HoodieWriterUtils.scala | 5 ++- .../org/apache/hudi/TestHoodieSparkSqlWriter.scala | 2 +- .../apache/hudi/functional/TestCOWDataSource.scala | 3 +- .../deltastreamer/TestHoodieDeltaStreamer.java | 6 +-- 9 files changed, 112 insertions(+), 46 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java index a8ae48e1d67..8431180a2fe 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordGenWrapperAvroKeyGenerator.java @@ -43,24 +43,24 @@ import java.util.List; * PartitionId refers to spark's partition Id. * RowId refers to the row index within the spark partition. */ -public class AutoRecordGenWrapperAvroKeyGenerator extends BaseKeyGenerator { +public class AutoRecordGenWrapperAvroKeyGenerator extends BaseKeyGenerator implements AutoRecordKeyGeneratorWrapper { private final BaseKeyGenerator keyGenerator; - private final int partitionId; - private final String instantTime; + private Integer partitionId; + private String instantTime; private int rowId; public AutoRecordGenWrapperAvroKeyGenerator(TypedProperties config, BaseKeyGenerator keyGenerator) { super(config); this.keyGenerator = keyGenerator; this.rowId = 0; -this.partitionId = config.getInteger(KeyGenUtils.RECORD_KEY_GEN_PARTITION_ID_CONFIG); -this.instantTime = config.getString(KeyGenUtils.RECORD_KEY_GEN_INSTANT_TIME_CONFIG); +partitionId = null; +instantTime = null; } @Override public String getRecordKey(GenericRecord record) { -return HoodieRecord.generateSequenceId(instantTime, partitionId, rowId++); +return generateSequenceId(rowId++); } @Override @@ -80,4 +80,19 @@ public class AutoRecordGenWrapperAvroKeyGenerator extends BaseKeyGenerator { public boolean isConsistentLogicalTimestampEnabled() { return keyGenerator.isConsistentLogicalTimestampEnabled(); } + + @Override + public BaseKeyGenerator getPartitionKeyGenerator() { +return keyGenerator; + } + + private String generateSequenceId(long recordIndex) { +if (partitionId == null) { + this.partitionId = config.getInteger(KeyGenUtils.RECORD_KEY_GEN_PARTITION_ID_CONFIG); +} +if (instantTime == null) { + this.instantTime = config.getString(KeyGenUtils.RECORD_KEY_GEN_INSTANT_TIME_CONFIG); +} +return HoodieRecord.generateSequenceId(instantTime, partitionId, recordIndex); + } } diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordKeyGeneratorWrapper.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordKeyGeneratorWrapper.java new file mode 100644 index 000..e136bc89cbb --- /dev/null +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/AutoRecordKeyGeneratorWrapper.java @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and
Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]
nsivabalan merged PR #10014: URL: https://github.com/apache/hudi/pull/10014 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]
nsivabalan commented on PR #10014: URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823817928 https://github.com/apache/hudi/assets/513218/f0efc544-a78a-4ee3-bed7-f403aea335fb;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823807872 ## CI report: * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102) * 032ad417971148eec41a5d41066b37d238ecf70a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1823804491 ## CI report: * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102) * 032ad417971148eec41a5d41066b37d238ecf70a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]
zyclove commented on issue #9826: URL: https://github.com/apache/hudi/issues/9826#issuecomment-1823781426 Hi, this issue occurs frequently, has it been resolved? As https://issues.apache.org/jira/browse/HUDI-6980 is not closed. When will version 0.14.1 be released? There is an urgent need to upgrade, including other issues. @ad1happy2go @pravin1406 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
hudi-bot commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823776282 ## CI report: * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Cannot encode decimal with precision 15 as max precision 14 [hudi]
njalan closed issue #10160: Cannot encode decimal with precision 15 as max precision 14 URL: https://github.com/apache/hudi/issues/10160 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]
majian1998 commented on code in PR #10120: URL: https://github.com/apache/hudi/pull/10120#discussion_r1402861158 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.FileSlice +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.metadata.HoodieTableMetadata +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter} + + +class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "partition", DataTypes.StringType), +ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField("file_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("column_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("min_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("max_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("null_num", DataTypes.LongType, nullable = true, Metadata.empty) + )) + + def parameters: Array[ProcedureParameter] = PARAMETERS + + def outputType: StructType = OUTPUT_TYPE + + override def call(args: ProcedureArgs): Seq[Row] = { +super.checkArgs(PARAMETERS, args) + +val table = getArgValueOrDefault(args, PARAMETERS(0)) +val partitions = getArgValueOrDefault(args, PARAMETERS(1)).getOrElse("").toString +val partitionsSeq = partitions.split(",").filter(_.nonEmpty).toSeq + +val targetColumns = getArgValueOrDefault(args, PARAMETERS(2)).getOrElse("").toString +val targetColumnsSeq = targetColumns.split(",").toSeq +val basePath = getBasePath(table) +val metadataConfig = HoodieMetadataConfig.newBuilder + .enable(true) + .build +val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build +val schemaUtil = new TableSchemaResolver(metaClient) +val schema = AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema) +val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, metadataConfig, metaClient) +val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = columnStatsIndex.loadColumnStatsIndexRecords(targetColumnsSeq, shouldReadInMemory = false) +val fsView = buildFileSystemView(table) +val allFileSlices: Set[FileSlice] = { + if (partitionsSeq.isEmpty) { +val engineCtx = new HoodieSparkEngineContext(jsc) +val metaTable = HoodieTableMetadata.create(engineCtx, metadataConfig, basePath) +metaTable.getAllPartitionPaths + .asScala + .flatMap(path => fsView.getLatestFileSlices(path).iterator().asScala) + .toSet + } else { +partitionsSeq +
Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]
stream2000 commented on code in PR #10120: URL: https://github.com/apache/hudi/pull/10120#discussion_r1402852595 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.FileSlice +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.metadata.HoodieTableMetadata +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter} + + +class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "partition", DataTypes.StringType), +ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField("file_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("column_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("min_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("max_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("null_num", DataTypes.LongType, nullable = true, Metadata.empty) + )) + + def parameters: Array[ProcedureParameter] = PARAMETERS + + def outputType: StructType = OUTPUT_TYPE + + override def call(args: ProcedureArgs): Seq[Row] = { +super.checkArgs(PARAMETERS, args) + +val table = getArgValueOrDefault(args, PARAMETERS(0)) +val partitions = getArgValueOrDefault(args, PARAMETERS(1)).getOrElse("").toString +val partitionsSeq = partitions.split(",").filter(_.nonEmpty).toSeq + +val targetColumns = getArgValueOrDefault(args, PARAMETERS(2)).getOrElse("").toString +val targetColumnsSeq = targetColumns.split(",").toSeq +val basePath = getBasePath(table) +val metadataConfig = HoodieMetadataConfig.newBuilder + .enable(true) + .build +val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build +val schemaUtil = new TableSchemaResolver(metaClient) +val schema = AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema) +val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, metadataConfig, metaClient) Review Comment: We should use `org.apache.hudi.metadata.BaseTableMetadata#getColumnStats` to load colunm stats instead of calling columnStatsIndex directly. ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License,
[jira] [Closed] (HUDI-7110) Add call procedure for show column stats information
[ https://issues.apache.org/jira/browse/HUDI-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7110. Resolution: Fixed Fixed via master branch: 8d6d04387753662a5bb41f35874c6bbdd7021b36 > Add call procedure for show column stats information > > > Key: HUDI-7110 > URL: https://issues.apache.org/jira/browse/HUDI-7110 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This feature introduces a call procedure that allows users to specify the > table name and column names to retrieve column stats information from the > metadata table. This functionality facilitates the observation of data > distribution status and assists in data skipping. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7110) Add call procedure for show column stats information
[ https://issues.apache.org/jira/browse/HUDI-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7110: - Fix Version/s: 1.0.0 > Add call procedure for show column stats information > > > Key: HUDI-7110 > URL: https://issues.apache.org/jira/browse/HUDI-7110 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This feature introduces a call procedure that allows users to specify the > table name and column names to retrieve column stats information from the > metadata table. This functionality facilitates the observation of data > distribution status and assists in data skipping. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]
danny0405 merged PR #10120: URL: https://github.com/apache/hudi/pull/10120 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7110] Add call procedure for show column stats information (#10120)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8d6d0438775 [HUDI-7110] Add call procedure for show column stats information (#10120) 8d6d0438775 is described below commit 8d6d04387753662a5bb41f35874c6bbdd7021b36 Author: majian <47964462+majian1...@users.noreply.github.com> AuthorDate: Thu Nov 23 10:08:17 2023 +0800 [HUDI-7110] Add call procedure for show column stats information (#10120) --- .../org/apache/hudi/ColumnStatsIndexSupport.scala | 2 +- .../hudi/command/procedures/HoodieProcedures.scala | 1 + .../ShowMetadataTableColumnStatsProcedure.scala| 169 + .../sql/hudi/procedure/TestMetadataProcedure.scala | 66 4 files changed, 237 insertions(+), 1 deletion(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala index dd76aee2f18..9cdb15092b0 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala @@ -309,7 +309,7 @@ class ColumnStatsIndexSupport(spark: SparkSession, colStatsDF.select(targetColumnStatsIndexColumns.map(col): _*) } - private def loadColumnStatsIndexRecords(targetColumns: Seq[String], shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = { + def loadColumnStatsIndexRecords(targetColumns: Seq[String], shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = { // Read Metadata Table's Column Stats Index records into [[HoodieData]] container by //- Fetching the records from CSI by key-prefixes (encoded column names) //- Extracting [[HoodieMetadataColumnStats]] records diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala index ad63ddbb29e..1a960ecb8fd 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala @@ -66,6 +66,7 @@ object HoodieProcedures { ,(ShowBootstrapPartitionsProcedure.NAME, ShowBootstrapPartitionsProcedure.builder) ,(UpgradeTableProcedure.NAME, UpgradeTableProcedure.builder) ,(DowngradeTableProcedure.NAME, DowngradeTableProcedure.builder) + ,(ShowMetadataTableColumnStatsProcedure.NAME, ShowMetadataTableColumnStatsProcedure.builder) ,(ShowMetadataTableFilesProcedure.NAME, ShowMetadataTableFilesProcedure.builder) ,(ShowMetadataTablePartitionsProcedure.NAME, ShowMetadataTablePartitionsProcedure.builder) ,(CreateMetadataTableProcedure.NAME, CreateMetadataTableProcedure.builder) diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala new file mode 100644 index 000..60aa0f054b9 --- /dev/null +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import
Re: [I] Cannot encode decimal with precision 15 as max precision 14 [hudi]
njalan commented on issue #10160: URL: https://github.com/apache/hudi/issues/10160#issuecomment-1823733255 @ad1happy2go It is already merged in 0.13.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]
hudi-bot commented on PR #10095: URL: https://github.com/apache/hudi/pull/10095#issuecomment-1823714576 ## CI report: * c1b5bd41ac1f4be476fb69f84f7197a27733eb23 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Remove unused import (#10159)
This is an automated email from the ASF dual-hosted git repository. leesf pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new aabaa9947fc [MINOR] Remove unused import (#10159) aabaa9947fc is described below commit aabaa9947fc0e6a72ed221f0889cad27423f8127 Author: huangxiaoping <1754789...@qq.com> AuthorDate: Thu Nov 23 09:06:45 2023 +0800 [MINOR] Remove unused import (#10159) --- .../org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala | 5 - .../src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala | 10 -- .../scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala | 1 - 3 files changed, 4 insertions(+), 12 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala index bd052c086ff..6ff9bd036e8 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala @@ -22,11 +22,6 @@ package org.apache.hudi import org.apache.hudi.common.model.HoodieFileGroupId import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.util.{ArrayData, MapData} -import org.apache.spark.sql.types.{DataType, Decimal} -import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} - -import java.util class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow, fileGroups: Map[HoodieFileGroupId, List[HoodieCDCFileSplit]] diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index cbde026adeb..01a73cd0816 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -27,11 +27,10 @@ import org.apache.hudi.DataSourceOptionsHelper.fetchMissingWriteConfigsFromTable import org.apache.hudi.DataSourceUtils.tryOverrideParquetWriteLegacyFormatProperty import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.HoodieConversionUtils.{toProperties, toScalaOption} -import org.apache.hudi.HoodieSparkSqlWriter.{CANONICALIZE_SCHEMA, SQL_MERGE_INTO_WRITES, StreamingWriteParams} +import org.apache.hudi.HoodieSparkSqlWriter.StreamingWriteParams import org.apache.hudi.HoodieWriterUtils._ -import org.apache.hudi.avro.AvroSchemaUtils.{isCompatibleProjectionOf, isSchemaCompatible, isValidEvolutionOf, resolveNullableSchema} +import org.apache.hudi.avro.AvroSchemaUtils.resolveNullableSchema import org.apache.hudi.avro.HoodieAvroUtils -import org.apache.hudi.avro.HoodieAvroUtils.removeMetadataFields import org.apache.hudi.client.common.HoodieSparkEngineContext import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient} import org.apache.hudi.commit.{DatasetBulkInsertCommitActionExecutor, DatasetBulkInsertOverwriteCommitActionExecutor, DatasetBulkInsertOverwriteTableCommitActionExecutor} @@ -49,12 +48,11 @@ import org.apache.hudi.common.util.{CommitUtils, StringUtils, Option => HOption} import org.apache.hudi.config.HoodieBootstrapConfig.{BASE_PATH, INDEX_CLASS_NAME} import org.apache.hudi.config.HoodieWriteConfig.SPARK_SQL_MERGE_INTO_PREPPED_KEY import org.apache.hudi.config.{HoodieCompactionConfig, HoodieInternalConfig, HoodieWriteConfig} -import org.apache.hudi.exception.{HoodieException, HoodieWriteConflictException, SchemaCompatibilityException} +import org.apache.hudi.exception.{HoodieException, HoodieWriteConflictException} import org.apache.hudi.hive.{HiveSyncConfigHolder, HiveSyncTool} import org.apache.hudi.internal.schema.InternalSchema import org.apache.hudi.internal.schema.convert.AvroInternalSchemaConverter -import org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchemaRequirements -import org.apache.hudi.internal.schema.utils.{AvroSchemaEvolutionUtils, SerDeHelper} +import org.apache.hudi.internal.schema.utils.SerDeHelper import org.apache.hudi.keygen.constant.KeyGeneratorType import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.getKeyGeneratorClassName diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala index
Re: [PR] [MINOR] Remove unused import [hudi]
leesf merged PR #10159: URL: https://github.com/apache/hudi/pull/10159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Async Clustering: Seeking Help on Specific Partitioning and Regex Pattern [hudi]
soumilshah1995 opened a new issue, #10165: URL: https://github.com/apache/hudi/issues/10165 Subject : Async Clustering: Seeking Help on Specific Partitioning and Regex Pattern I'm currently exploring async clustering in Apache Hudi, and this is also intended for a community video. I've successfully executed async clustering, but I have a question regarding how to cluster specific partitions or partitions based on a regex pattern. After reviewing the documentation at https://hudi.apache.org/docs/next/clustering/, I came across the following setting hoodie.clustering.plan.strategy.partition.selected N/A (Required) Comma-separated list of I attempted to use this, but when I run the "show clustering" command, I observe the following result: ``` +-++-+---+ | timestamp | input_group_size| state | involved_partitions| +-++-+---+ |20231122190057844| 1 | COMPLETED | *| +-++-+---+ ``` As you can see, the involved partition shows '*', and I'm perplexed as to why it applied to all partitions rather than specific ones. Here's my Spark submit command: spark-submit \ --class org.apache.hudi.utilities.HoodieClusteringJob \ --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \ --properties-file spark-config.properties \ --master 'local[*]' \ --executor-memory 1g \ jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \ --mode scheduleAndExecute \ --base-path file:///Users/soumilnitinshah/Downloads/hudidb/silver/ \ --table-name orders \ --hoodie-conf hoodie.clustering.async.enabled=true \ --hoodie-conf hoodie.clustering.async.max.commits=2 \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=629145600 \ --hoodie-conf hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=order_date \ --hoodie-conf hoodie.clustering.plan.strategy.partition.selected=2023-10-23 \ --hoodie-conf hoodie.write.concurrency.mode=optimistic_concurrency_control \ --hoodie-conf hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider Additionally, I'm curious about how to cluster partitions based on a regex pattern using the setting: hoodie.clustering.plan.strategy.partition.regex.pattern Is it correct to set the value to "2023-10-[0-9]" to match all partitions based on this pattern? I appreciate any insights or guidance on this matter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
hudi-bot commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823692854 ## CI report: * 48df6bbec2473dbbbedb1b723896acb17056e80f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21076) * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
rmahindra123 commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823692075 Approved -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7086] Scaling gcs event source [hudi]
hudi-bot commented on PR #10073: URL: https://github.com/apache/hudi/pull/10073#issuecomment-1823688411 ## CI report: * 48df6bbec2473dbbbedb1b723896acb17056e80f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21076) * 91b8b5ff8242d5fa0f01fc78ba55f70d458e58c9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] update disaster recovery docs [hudi]
sagarlakshmipathy opened a new pull request, #10164: URL: https://github.com/apache/hudi/pull/10164 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ > added for loop to avoid copy pasting code > added note to make sure users replace the commit and savepoint ts > made markdown edits > fixed indentation ### Impact _Describe any public API or user-facing feature change or any performance impact._ > documentation change ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ > low risk - documentation change ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ > documentation change - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - local `npm run build` and `npm run serve` passed - [NA] CI passed - CI runs after raising PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Asf site update disaster recovery doc [hudi]
sagarlakshmipathy closed pull request #10163: Asf site update disaster recovery doc URL: https://github.com/apache/hudi/pull/10163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]
hudi-bot commented on PR #10014: URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823617478 ## CI report: * 5e60b3d12b40a04006d3697fa99538e9e494b96c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823590565 ## CI report: * cae921ac9d016d28b87139b5c0fd24debadf1592 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21109) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]
nsivabalan commented on code in PR #10095: URL: https://github.com/apache/hudi/pull/10095#discussion_r1402767000 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -70,6 +72,7 @@ public class S3EventsHoodieIncrSource extends HoodieIncrSource { private static final Logger LOG = LoggerFactory.getLogger(S3EventsHoodieIncrSource.class); + private static final String EMPTY_STRING = ""; Review Comment: StringUtils already contains EMPTY_STRING ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -133,7 +136,13 @@ public S3EventsHoodieIncrSource( this.srcPath = getStringWithAltKeys(props, HOODIE_SRC_BASE_PATH); this.numInstantsPerFetch = getIntWithAltKeys(props, NUM_INSTANTS_PER_FETCH); this.checkIfFileExists = getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK); -this.fileFormat = getStringWithAltKeys(props, DATAFILE_FORMAT, true); + +// This is to ensure backward compatibility where we were using the +// config SOURCE_FILE_FORMAT for file format in previous versions. +this.fileFormat = Strings.isNullOrEmpty(getStringWithAltKeys(props, DATAFILE_FORMAT, EMPTY_STRING)) +? getStringWithAltKeys(props, SOURCE_FILE_FORMAT, true) +: getStringWithAltKeys(props, DATAFILE_FORMAT, EMPTY_STRING); Review Comment: for last one ``` getStringWithAltKeys(props, DATAFILE_FORMAT, EMPTY_STRING); ``` you can ignore the last arg. ``` getStringWithAltKeys(props, DATAFILE_FORMAT); ``` default value will not be picked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Asf site update disaster recovery doc [hudi]
sagarlakshmipathy opened a new pull request, #10163: URL: https://github.com/apache/hudi/pull/10163 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ > added for loops in 2 places to avoid copy pasting effort > fixed indentation in 3 places > added note to make sure users replace the commit and savepoint timestamp ### Impact _Describe any public API or user-facing feature change or any performance impact._ > doc update - no impact ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ > doc update - no impact ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ > doc update was the goal - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [local testing] Adequate tests were added if applicable - [NA] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]
hudi-bot commented on PR #10095: URL: https://github.com/apache/hudi/pull/10095#issuecomment-1823585243 ## CI report: * a6476f06265d7600755e5597af173fea6db2954f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21093) * c1b5bd41ac1f4be476fb69f84f7197a27733eb23 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) [hudi]
hudi-bot commented on PR #10095: URL: https://github.com/apache/hudi/pull/10095#issuecomment-1823579673 ## CI report: * a6476f06265d7600755e5597af173fea6db2954f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21093) * c1b5bd41ac1f4be476fb69f84f7197a27733eb23 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format [hudi]
nsivabalan commented on code in PR #9567: URL: https://github.com/apache/hudi/pull/9567#discussion_r1402761264 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java: ## @@ -241,31 +246,86 @@ private List listStatusForSnapshotMode(JobConf job, boolean shouldIncludePendingCommits = HoodieHiveUtils.shouldIncludePendingCommits(job, tableMetaClient.getTableConfig().getTableName()); - HiveHoodieTableFileIndex fileIndex = - new HiveHoodieTableFileIndex( - engineContext, - tableMetaClient, - props, - HoodieTableQueryType.SNAPSHOT, - partitionPaths, - queryCommitInstant, - shouldIncludePendingCommits); - - Map> partitionedFileSlices = fileIndex.listFileSlices(); - - targetFiles.addAll( - partitionedFileSlices.values() - .stream() - .flatMap(Collection::stream) - .filter(fileSlice -> checkIfValidFileSlice(fileSlice)) - .map(fileSlice -> createFileStatusUnchecked(fileSlice, fileIndex, tableMetaClient)) - .collect(Collectors.toList()) - ); + if (HoodieTableMetadataUtil.isFilesPartitionAvailable(tableMetaClient) || conf.getBoolean(ENABLE.key(), ENABLE.defaultValue())) { Review Comment: default value for metadata is false for reader, and true for writer. So, we should use the reader side default here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823515572 ## CI report: * 597f6d7bd7134d635ad5a675bd398ba03faafef8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21096) * cae921ac9d016d28b87139b5c0fd24debadf1592 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21109) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1823471080 ## CI report: * 597f6d7bd7134d635ad5a675bd398ba03faafef8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21096) * cae921ac9d016d28b87139b5c0fd24debadf1592 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]
hudi-bot commented on PR #10162: URL: https://github.com/apache/hudi/pull/10162#issuecomment-1823447272 ## CI report: * 64589da09eb106b1fc771ca77b64d30c81ae5970 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21106) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]
hudi-bot commented on PR #10014: URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823446823 ## CI report: * 80725367a7e21160545ffa27ec1275a32e47e7c4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21092) * 5e60b3d12b40a04006d3697fa99538e9e494b96c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7052] Fix partition key validation for custom key generators. [hudi]
hudi-bot commented on PR #10014: URL: https://github.com/apache/hudi/pull/10014#issuecomment-1823391577 ## CI report: * 80725367a7e21160545ffa27ec1275a32e47e7c4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21092) * 5e60b3d12b40a04006d3697fa99538e9e494b96c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Schema evolution error: promoted data type from integer to double [hudi]
kenny291 opened a new issue, #3558: URL: https://github.com/apache/hudi/issues/3558 **Description** Hi all, I tested schema evolution change data type from int to double, but it did not work with Hudi. (hudi doc: https://github.com/apache/hudi/blob/asf-site/website/docs/schema_evolution.md). I also tried to test change data type from float to double, it had the same error. **To Reproduce** Steps to reproduce the behavior: 1. init spark context ``` ./spark-shell \ --packages org.apache.spark:spark-avro_2.12:3.1.2,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'\ --conf 'spark.hadoop.fs.s3a.access.key=xx'\ --conf 'spark.hadoop.fs.s3a.secret.key=xx'\ --conf 'spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem'\ --conf 'spark.hadoop.fs.s3a.endpoint=s3.amazonaws.com'\ --conf 'spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider'\ --conf 'spark.hadoop.fs.s3a.fast.upload=true'\ --conf 'spark.hadoop.fs.s3a.multiobjectdelete.enable=false'\ --conf 'spark.sql.parquet.filterPushdown=true'\ --conf 'spark.sql.parquet.mergeSchema=false'\ --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2'\ --conf 'spark.speculation=false'\ --conf 'hive.metastore.schema.verification=false'\ --conf 'hive.metastore.schema.verification.record.version=false'\ --conf spark.sql.hive.convertMetastoreParquet=false ``` 2. create base hudi table ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val tableName = "hudi_trips_cow" val basePath = "s3a://data-lake/hudi_test/hudi_trips_cow_schema_change" val schema = StructType( Array( StructField("rowId", StringType,true), StructField("partitionId", StringType,true), StructField("preComb", LongType,true), StructField("name", StringType,true), StructField("versionId", StringType,true), StructField("intToLong", IntegerType,true),//ok StructField("intToDouble", IntegerType,true), StructField("longToFloat", LongType,true),//ok // StructField("longToDouble", IntegerType,true), StructField("floatToDouble", FloatType,true) )) // 9 cols val data1 = Seq(Row("row_1", "part_0", 0L, "bob", "v_0", 0, 1, 1L, 1.1f), Row("row_2", "part_0", 0L, "john", "v_0", 0, 1, 2L, 1.2f), Row("row_3", "part_3", 0L, "tom", "v_0", 0, 1, 3L, 1.3f)) var dfFromData1 = spark.createDataFrame(data1, schema) dfFromData1.write.format("hudi"). options(getQuickstartWriteConfigs). option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). option(RECORDKEY_FIELD_OPT_KEY, "rowId"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId"). option("hoodie.index.type","SIMPLE"). option("hoodie.datasource.write.hive_style_partitioning", true). option(TABLE_NAME, tableName). mode(Overwrite). save(basePath) ``` 3. Change column `intToDouble` data type from int to double and append new data to old table. ``` // Int to double val newSchema = StructType( Array( StructField("rowId", StringType,true), StructField("partitionId", StringType,true), StructField("preComb", LongType,true), StructField("name", StringType,true), StructField("versionId", StringType,true), StructField("intToLong", IntegerType,true), StructField("intToDouble", DoubleType,true), StructField("longToFloat", LongType,true), // StructField("longToDouble", IntegerType,true), StructField("floatToDouble", FloatType,true) )) // 9 col val data2 = Seq(Row("row_2", "part_0", 5L, "john", "v_3", 3, 1D, 2l, 1.8f), Row("row_5", "part_0", 5L, "maroon", "v_2", 2, 1D, 2l, 1.8f), Row("row_9", "part_9", 5L, "michael", "v_2", 2, 1D, 2l, 1.8f)) var dfFromData2 = spark.createDataFrame(data2, newSchema) dfFromData2.write.format("hudi"). options(getQuickstartWriteConfigs). option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). option(RECORDKEY_FIELD_OPT_KEY, "rowId"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId"). option("hoodie.datasource.write.hive_style_partitioning", true). option("hoodie.index.type","SIMPLE"). option(TABLE_NAME, tableName). mode(Append). save(basePath) ``` 4. Read hudi table
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
hudi-bot commented on PR #10157: URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823352506 ## CI report: * 3b24d4130099aab67c76de81f77701c730f2e78a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7123] Improve CI scripts (#10136)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f88a73f09e7 [HUDI-7123] Improve CI scripts (#10136) f88a73f09e7 is described below commit f88a73f09e753ffe0a8029b490a8a430b05b8eaf Author: Y Ethan Guo AuthorDate: Wed Nov 22 10:48:48 2023 -0800 [HUDI-7123] Improve CI scripts (#10136) Improves the CI scripts in the following aspects: - Removes `hudi-common` tests from `test-spark` job in GH CI as they are already covered by Azure CI - Removes unnecesary bundle validation jobs and adds new bundle validation images (`flink1153hive313spark323`, `flink1162hive313spark331`) - Updates `validate-release-candidate-bundles` jobs - Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 (3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time. --- .github/workflows/bot.yml | 30 +- azure-pipelines-20230430.yml | 6 +++-- .../base/build_flink1153hive313spark323.sh | 26 +++ .../base/build_flink1162hive313spark331.sh | 26 +++ packaging/bundle-validation/ci_run.sh | 20 ++- 5 files changed, 88 insertions(+), 20 deletions(-) diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml index cff377ed13f..67c7ac16eaa 100644 --- a/.github/workflows/bot.yml +++ b/.github/workflows/bot.yml @@ -98,7 +98,7 @@ jobs: SCALA_PROFILE: ${{ matrix.scalaProfile }} SPARK_PROFILE: ${{ matrix.sparkProfile }} run: - mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS -am -pl "hudi-examples/hudi-examples-spark,hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" + mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS -am -pl "hudi-examples/hudi-examples-spark,$SPARK_COMMON_MODULES,$SPARK_MODULES" - name: Quickstart Test env: SCALA_PROFILE: ${{ matrix.scalaProfile }} @@ -112,7 +112,7 @@ jobs: SPARK_MODULES: ${{ matrix.sparkModules }} if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI run: - mvn test -Punit-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + mvn test -Punit-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS - name: FT - Spark env: SCALA_PROFILE: ${{ matrix.scalaProfile }} @@ -299,19 +299,13 @@ jobs: - flinkProfile: 'flink1.18' sparkProfile: 'spark3.4' sparkRuntime: 'spark3.4.0' - - flinkProfile: 'flink1.18' -sparkProfile: 'spark3.3' -sparkRuntime: 'spark3.3.2' - flinkProfile: 'flink1.17' sparkProfile: 'spark3.3' sparkRuntime: 'spark3.3.2' - flinkProfile: 'flink1.16' -sparkProfile: 'spark3.3' -sparkRuntime: 'spark3.3.2' - - flinkProfile: 'flink1.15' sparkProfile: 'spark3.3' sparkRuntime: 'spark3.3.1' - - flinkProfile: 'flink1.14' + - flinkProfile: 'flink1.15' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - flinkProfile: 'flink1.14' @@ -380,18 +374,30 @@ jobs: strategy: matrix: include: - - flinkProfile: 'flink1.16' + - flinkProfile: 'flink1.18' sparkProfile: 'spark3' +sparkRuntime: 'spark3.5.0' + - flinkProfile: 'flink1.18' +sparkProfile: 'spark3.5' +sparkRuntime: 'spark3.5.0' + - flinkProfile: 'flink1.18' +sparkProfile: 'spark3.4' +sparkRuntime: 'spark3.4.0' + - flinkProfile: 'flink1.17' +sparkProfile: 'spark3.3' sparkRuntime: 'spark3.3.2' - - flinkProfile: 'flink1.15' + - flinkProfile: 'flink1.16' sparkProfile: 'spark3.3' sparkRuntime: 'spark3.3.1' - - flinkProfile: 'flink1.14' + - flinkProfile: 'flink1.15' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - flinkProfile: 'flink1.14' sparkProfile: 'spark3.1' sparkRuntime: 'spark3.1.3' + - flinkProfile: 'flink1.14' +sparkProfile: 'spark3.0' +sparkRuntime: 'spark3.0.2' - flinkProfile: 'flink1.14' sparkProfile: 'spark' sparkRuntime: 'spark2.4.8' diff --git a/azure-pipelines-20230430.yml b/azure-pipelines-20230430.yml index 21c6d932ef9..c2a5f9d5a44 100644 --- a/azure-pipelines-20230430.yml +++
Re: [PR] [HUDI-7123] Improve CI scripts [hudi]
yihua merged PR #10136: URL: https://github.com/apache/hudi/pull/10136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]
hudi-bot commented on PR #10162: URL: https://github.com/apache/hudi/pull/10162#issuecomment-1823196032 ## CI report: * 64589da09eb106b1fc771ca77b64d30c81ae5970 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21106) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
hudi-bot commented on PR #10157: URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823195959 ## CI report: * 1ecd7d0aaf9a406be3d134a0202911a7b32f05bd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21101) * 3b24d4130099aab67c76de81f77701c730f2e78a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7135) Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory
[ https://issues.apache.org/jira/browse/HUDI-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7135: - Labels: pull-request-available (was: ) > Spark reads hudi table error when flink creates the table without preCombine > fields by catalog or factory > - > > Key: HUDI-7135 > URL: https://issues.apache.org/jira/browse/HUDI-7135 > Project: Apache Hudi > Issue Type: Bug >Reporter: 陈磊 >Priority: Major > Labels: pull-request-available > > Create a table through dfs catalog, hms catalog, or sink ddl, and then query > the data of the table through spark, and an exception occurs: > Java. util. NoSuchElementException: key not found: ts > demo: > 1. create a table through hms catalog: > {panel:title=hms catalog create table} > CREATE CATALOG hudi_catalog WITH( > 'type' = 'hudi', > 'mode' = 'hms' > ); > CREATE TABLE hudi_catalog.`default`.ct1 > ( > f1 string, > f2 string > ) WITH ( > 'connector' = 'hudi', > 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ct1', > 'table.type' = 'COPY_ON_WRITE', > 'write.operation' = 'insert' > ); > {panel} > 2. spark query > {panel:title=spark query} > select * from ct1 > {panel} > 3. exception > {panel:title=exception} > java.util.NoSuchElementException: key not found: ts > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
hudi-bot commented on PR #10157: URL: https://github.com/apache/hudi/pull/10157#issuecomment-1823184561 ## CI report: * 1ecd7d0aaf9a406be3d134a0202911a7b32f05bd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21101) * 3b24d4130099aab67c76de81f77701c730f2e78a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]
hudi-bot commented on PR #10162: URL: https://github.com/apache/hudi/pull/10162#issuecomment-1823184657 ## CI report: * 64589da09eb106b1fc771ca77b64d30c81ae5970 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Remove unused import [hudi]
hudi-bot commented on PR #10159: URL: https://github.com/apache/hudi/pull/10159#issuecomment-1823173019 ## CI report: * 72e6a610b88f3d269477fd967b970c48fbc6f387 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21103) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing
[ https://issues.apache.org/jira/browse/HUDI-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7136: - Labels: pull-request-available (was: ) > in the dfs catalog scenario, solve the problem of Primary key definition is > missing > --- > > Key: HUDI-7136 > URL: https://issues.apache.org/jira/browse/HUDI-7136 > Project: Apache Hudi > Issue Type: Bug >Reporter: 陈磊 >Priority: Major > Labels: pull-request-available > > in the dfs catalog scenario, solve the problem of Primary key definition is > missing > demo: > {code:java} > // sql > CREATE CATALOG hudi_catalog WITH( > 'type' = 'hudi', > 'catalog.path' = 'file:///Users/x/x/others/data/hudi-warehouse/', > 'mode' = 'dfs' > ); > CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4 > ( > f1 string, > f2 string, > primary key (f1) not enforced > ) WITH ( > 'connector' = 'hudi', > 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ctn4', > 'table.type' = 'MERGE_ON_READ', > 'write.operation' = 'upsert', > 'hoodie.datasource.write.recordkey.field' = 'f1' > ) ; > {code} > exception: > {code:java} > Primary key definition is missing {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7136] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]
empcl opened a new pull request, #10162: URL: https://github.com/apache/hudi/pull/10162 …ion is missing ### Change Logs in the dfs catalog scenario, solve the problem of Primary key definition is missing ### Impact no ### Risk level (write none, low medium or high below) no ### Documentation Update no ### Contributor's checklist no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing
[ https://issues.apache.org/jira/browse/HUDI-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 陈磊 updated HUDI-7136: - Description: in the dfs catalog scenario, solve the problem of Primary key definition is missing demo: {code:java} // sql CREATE CATALOG hudi_catalog WITH( 'type' = 'hudi', 'catalog.path' = 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/', 'mode' = 'dfs' ); CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4 ( f1 string, f2 string, primary key (f1) not enforced ) WITH ( 'connector' = 'hudi', 'path' = 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/ctn4', 'table.type' = 'MERGE_ON_READ', 'write.operation' = 'upsert', 'hoodie.datasource.write.recordkey.field' = 'f1' ) ; {code} exception: was: in the dfs catalog scenario, solve the problem of Primary key definition is missing demo: > in the dfs catalog scenario, solve the problem of Primary key definition is > missing > --- > > Key: HUDI-7136 > URL: https://issues.apache.org/jira/browse/HUDI-7136 > Project: Apache Hudi > Issue Type: Bug >Reporter: 陈磊 >Priority: Major > > in the dfs catalog scenario, solve the problem of Primary key definition is > missing > demo: > {code:java} > // sql > CREATE CATALOG hudi_catalog WITH( > 'type' = 'hudi', > 'catalog.path' = > 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/', > 'mode' = 'dfs' > ); > CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4 > ( > f1 string, > f2 string, > primary key (f1) not enforced > ) WITH ( > 'connector' = 'hudi', > 'path' = > 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/ctn4', > 'table.type' = 'MERGE_ON_READ', > 'write.operation' = 'upsert', > 'hoodie.datasource.write.recordkey.field' = 'f1' > ) ; > {code} > exception: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing
[ https://issues.apache.org/jira/browse/HUDI-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 陈磊 updated HUDI-7136: - Description: in the dfs catalog scenario, solve the problem of Primary key definition is missing demo: {code:java} // sql CREATE CATALOG hudi_catalog WITH( 'type' = 'hudi', 'catalog.path' = 'file:///Users/x/x/others/data/hudi-warehouse/', 'mode' = 'dfs' ); CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4 ( f1 string, f2 string, primary key (f1) not enforced ) WITH ( 'connector' = 'hudi', 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ctn4', 'table.type' = 'MERGE_ON_READ', 'write.operation' = 'upsert', 'hoodie.datasource.write.recordkey.field' = 'f1' ) ; {code} exception: {code:java} Primary key definition is missing {code} was: in the dfs catalog scenario, solve the problem of Primary key definition is missing demo: {code:java} // sql CREATE CATALOG hudi_catalog WITH( 'type' = 'hudi', 'catalog.path' = 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/', 'mode' = 'dfs' ); CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4 ( f1 string, f2 string, primary key (f1) not enforced ) WITH ( 'connector' = 'hudi', 'path' = 'file:///Users/chenlei677/chenlei677/others/data/hudi-warehouse/ctn4', 'table.type' = 'MERGE_ON_READ', 'write.operation' = 'upsert', 'hoodie.datasource.write.recordkey.field' = 'f1' ) ; {code} exception: > in the dfs catalog scenario, solve the problem of Primary key definition is > missing > --- > > Key: HUDI-7136 > URL: https://issues.apache.org/jira/browse/HUDI-7136 > Project: Apache Hudi > Issue Type: Bug >Reporter: 陈磊 >Priority: Major > > in the dfs catalog scenario, solve the problem of Primary key definition is > missing > demo: > {code:java} > // sql > CREATE CATALOG hudi_catalog WITH( > 'type' = 'hudi', > 'catalog.path' = 'file:///Users/x/x/others/data/hudi-warehouse/', > 'mode' = 'dfs' > ); > CREATE TABLE IF NOT EXISTS hudi_catalog.tmp.ctn4 > ( > f1 string, > f2 string, > primary key (f1) not enforced > ) WITH ( > 'connector' = 'hudi', > 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ctn4', > 'table.type' = 'MERGE_ON_READ', > 'write.operation' = 'upsert', > 'hoodie.datasource.write.recordkey.field' = 'f1' > ) ; > {code} > exception: > {code:java} > Primary key definition is missing {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7136) in the dfs catalog scenario, solve the problem of Primary key definition is missing
陈磊 created HUDI-7136: Summary: in the dfs catalog scenario, solve the problem of Primary key definition is missing Key: HUDI-7136 URL: https://issues.apache.org/jira/browse/HUDI-7136 Project: Apache Hudi Issue Type: Bug Reporter: 陈磊 in the dfs catalog scenario, solve the problem of Primary key definition is missing demo: -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] Cannot encode decimal with precision 15 as max precision 14 [hudi]
ad1happy2go commented on issue #10160: URL: https://github.com/apache/hudi/issues/10160#issuecomment-1823026376 @njalan I remember a similar issue before also, This issue got fixed in this PR - https://github.com/apache/hudi/pull/8063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7135) Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory
陈磊 created HUDI-7135: Summary: Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory Key: HUDI-7135 URL: https://issues.apache.org/jira/browse/HUDI-7135 Project: Apache Hudi Issue Type: Bug Reporter: 陈磊 Create a table through dfs catalog, hms catalog, or sink ddl, and then query the data of the table through spark, and an exception occurs: Java. util. NoSuchElementException: key not found: ts demo: 1. create a table through hms catalog: {panel:title=hms catalog create table} CREATE CATALOG hudi_catalog WITH( 'type' = 'hudi', 'mode' = 'hms' ); CREATE TABLE hudi_catalog.`default`.ct1 ( f1 string, f2 string ) WITH ( 'connector' = 'hudi', 'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ct1', 'table.type' = 'COPY_ON_WRITE', 'write.operation' = 'insert' ); {panel} 2. spark query {panel:title=spark query} select * from ct1 {panel} 3. exception {panel:title=exception} java.util.NoSuchElementException: key not found: ts {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7034] Refresh index fix - remove cached file slices within part… [hudi]
hudi-bot commented on PR #10151: URL: https://github.com/apache/hudi/pull/10151#issuecomment-1822983761 ## CI report: * 190b9df539423cb5da8f01b400426d9e97f7bab4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21098) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (#10152)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 38c87b7ebe1 [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (#10152) 38c87b7ebe1 is described below commit 38c87b7ebe148e8870db83be433376ad89b9c048 Author: harshal AuthorDate: Wed Nov 22 20:53:42 2023 +0530 [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (#10152) --- .../apache/hudi/common/config/TypedProperties.java | 5 ++ .../sources/GcsEventsHoodieIncrSource.java | 7 +- .../hudi/utilities/sources/HoodieIncrSource.java | 6 +- .../sources/S3EventsHoodieIncrSource.java | 9 ++- .../sources/SnapshotLoadQuerySplitter.java | 9 +++ .../utilities/sources/helpers/QueryRunner.java | 35 + .../sources/TestGcsEventsHoodieIncrSource.java | 85 -- .../sources/TestS3EventsHoodieIncrSource.java | 78 ++-- 8 files changed, 198 insertions(+), 36 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java b/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java index 3db8210cade..86b7f4cc457 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java @@ -18,6 +18,7 @@ package org.apache.hudi.common.config; +import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.StringUtils; import java.io.Serializable; @@ -78,6 +79,10 @@ public class TypedProperties extends Properties implements Serializable { return containsKey(property) ? getProperty(property) : defaultValue; } + public Option getNonEmptyStringOpt(String property, String defaultValue) { +return Option.ofNullable(StringUtils.emptyToNull(getString(property, defaultValue))); + } + public List getStringList(String property, String delimiter, List defaultVal) { if (!containsKey(property)) { return defaultVal; diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java index d09bad71916..a06130d3972 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java @@ -114,6 +114,7 @@ public class GcsEventsHoodieIncrSource extends HoodieIncrSource { private final CloudDataFetcher gcsObjectDataFetcher; private final QueryRunner queryRunner; private final Option schemaProvider; + private final Option snapshotLoadQuerySplitter; public static final String GCS_OBJECT_KEY = "name"; @@ -145,6 +146,7 @@ public class GcsEventsHoodieIncrSource extends HoodieIncrSource { this.gcsObjectDataFetcher = gcsObjectDataFetcher; this.queryRunner = queryRunner; this.schemaProvider = Option.ofNullable(schemaProvider); +this.snapshotLoadQuerySplitter = SnapshotLoadQuerySplitter.getInstance(props); LOG.info("srcPath: " + srcPath); LOG.info("missingCheckpointStrategy: " + missingCheckpointStrategy); @@ -171,8 +173,9 @@ public class GcsEventsHoodieIncrSource extends HoodieIncrSource { return Pair.of(Option.empty(), queryInfo.getStartInstant()); } -Dataset cloudObjectMetadataDF = queryRunner.run(queryInfo); -Dataset filteredSourceData = gcsObjectMetadataFetcher.applyFilter(cloudObjectMetadataDF); +Pair> queryInfoDatasetPair = queryRunner.run(queryInfo, snapshotLoadQuerySplitter); +Dataset filteredSourceData = gcsObjectMetadataFetcher.applyFilter(queryInfoDatasetPair.getRight()); +queryInfo = queryInfoDatasetPair.getLeft(); LOG.info("Adjusting end checkpoint:" + queryInfo.getEndInstant() + " based on sourceLimit :" + sourceLimit); Pair>> checkPointAndDataset = IncrSourceHelper.filterAndGenerateCheckpointBasedOnSourceLimit( diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java index 1d302fa106b..f87e5c231bf 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java @@ -25,7 +25,6 @@ import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.table.timeline.TimelineUtils.HollowCommitHandling; import org.apache.hudi.common.util.CollectionUtils; import org.apache.hudi.common.util.Option; -import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.common.util.collection.Pair;
Re: [PR] [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources [hudi]
nsivabalan merged PR #10152: URL: https://github.com/apache/hudi/pull/10152 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (cda9dbca206 -> d0edfb55ca2)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from cda9dbca206 [HUDI-7129] Fix bug when upgrade from table version three using UpgradeOrDowngradeProcedure (#10147) add d0edfb55ca2 [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker (#10150) No new revisions were added by this update. Summary of changes: .../common/model/DefaultHoodieRecordPayload.java | 29 -- .../model/TestDefaultHoodieRecordPayload.java | 9 ++- 2 files changed, 35 insertions(+), 3 deletions(-)
Re: [PR] [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker [hudi]
nsivabalan merged PR #10150: URL: https://github.com/apache/hudi/pull/10150 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Spark reads hudi table error when flink creates the table without pre… [hudi]
hudi-bot commented on PR #10157: URL: https://github.com/apache/hudi/pull/10157#issuecomment-1822968869 ## CI report: * 1ecd7d0aaf9a406be3d134a0202911a7b32f05bd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21101) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]
hudi-bot commented on PR #10120: URL: https://github.com/apache/hudi/pull/10120#issuecomment-1822968498 ## CI report: * a7f986bd546e2c38c241ee743734dbec491b0351 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21099) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]
empcl closed pull request #10161: in the dfs catalog scenario, solve the problem of Primary key definit… URL: https://github.com/apache/hudi/pull/10161 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] in the dfs catalog scenario, solve the problem of Primary key definit… [hudi]
empcl opened a new pull request, #10161: URL: https://github.com/apache/hudi/pull/10161 …ion is missing ### Change Logs in the dfs catalog scenario, solve the problem of Primary key definition is missing ### Impact no ### Risk level (write none, low medium or high below) no ### Documentation Update no ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1822878847 ## CI report: * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1822849025 ## CI report: * 597f6d7bd7134d635ad5a675bd398ba03faafef8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21096) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Cannot encode decimal with precision 15 as max precision 14 [hudi]
njalan opened a new issue, #10160: URL: https://github.com/apache/hudi/issues/10160 Got below error message when try to load data from postgresql into hudi, But it is working fine on hudi 0.9. Caused by: org.apache.hudi.exception.HoodieException: org.apache.avro.AvroTypeException: Cannot encode decimal with precision 15 as max precision 14 at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73) at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:154) ... 31 more Caused by: org.apache.avro.AvroTypeException: Cannot encode decimal with precision 15 as max precision 14 at org.apache.avro.Conversions$DecimalConversion.validate(Conversions.java:140) at org.apache.avro.Conversions$DecimalConversion.toFixed(Conversions.java:104) at org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryTypeWithDiffSchemaType(HoodieAvroUtils.java:994) at org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryType(HoodieAvroUtils.java:921) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:866) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:864) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:822) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:786) at org.apache.hudi.common.model.HoodieAvroIndexedRecord.rewriteRecordWithNewSchema(HoodieAvroIndexedRecord.java:123) at org.apache.hudi.common.model.HoodieRecord.rewriteRecordWithNewSchema(HoodieRecord.java:369) at org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$runMerge$1(HoodieMergeHelper.java:143) at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:66) I think the source table with column ** numeric(14, 4)** call the issue. Environment Description Hudi version : 0.13.1 Spark version : 3.0.1 Hive version : 3.1 Hadoop version : 3.2.2 Storage (HDFS/S3/GCS..) : Running on Docker? no : -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Remove unused import [hudi]
hudi-bot commented on PR #10159: URL: https://github.com/apache/hudi/pull/10159#issuecomment-1822776587 ## CI report: * 72e6a610b88f3d269477fd967b970c48fbc6f387 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21103) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1822776507 ## CI report: * c8c49d513c8b91b2ff8462f6db25203ba563d39a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21102) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Remove unused import [hudi]
hudi-bot commented on PR #10159: URL: https://github.com/apache/hudi/pull/10159#issuecomment-1822763645 ## CI report: * 72e6a610b88f3d269477fd967b970c48fbc6f387 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync [hudi]
hudi-bot commented on PR #10158: URL: https://github.com/apache/hudi/pull/10158#issuecomment-1822763563 ## CI report: * c8c49d513c8b91b2ff8462f6db25203ba563d39a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org