[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672596213 ## CI report: * 89c387c5edc9044786899bc1288e35121df600f9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19235) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zlinsc commented on issue #9319: [SUPPORT] how to use HiveSyncConfig instead of hive configs in DataSourceWriteOptions object
zlinsc commented on issue #9319: URL: https://github.com/apache/hudi/issues/9319#issuecomment-1672579627 > @zlinsc You can use META_SYNC_DATABASE_NAME and META_SYNC_TABLE_NAME from HoodieSyncConfig. Whether HoodieSyncConfig will replace all the configs in the future? I found that it troubles me to find the correct variable sometimes and I have to search in other git codes. I hope hudi will unify all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
hudi-bot commented on PR #9412: URL: https://github.com/apache/hudi/pull/9412#issuecomment-1672559947 ## CI report: * 24c43e61e9a304224df2ca5e2001551974348671 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19237) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
hudi-bot commented on PR #9412: URL: https://github.com/apache/hudi/pull/9412#issuecomment-1672553728 ## CI report: * 24c43e61e9a304224df2ca5e2001551974348671 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
danny0405 commented on code in PR #9412: URL: https://github.com/apache/hudi/pull/9412#discussion_r1289540657 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableLikeCommand.scala: ## @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command + +import org.apache.hudi.SparkAdapterSupport +import org.apache.hudi.common.model.HoodieTableType +import org.apache.hudi.common.util.ConfigUtils +import org.apache.spark.sql.{AnalysisException, Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType, HoodieCatalogTable} +import org.apache.spark.sql.catalyst.util.CharVarcharUtils +import org.apache.spark.sql.hudi.HoodieOptionConfig + +import scala.util.control.NonFatal + +case class CreateHoodieTableLikeCommand(targetTable: TableIdentifier, +sourceTable: TableIdentifier, +fileFormat: CatalogStorageFormat, +properties: Map[String, String] = Map.empty, +ignoreIfExists: Boolean) + extends HoodieLeafRunnableCommand with SparkAdapterSupport { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog + +val tableIsExists = catalog.tableExists(targetTable) +if (tableIsExists) { + if (ignoreIfExists) { +// scalastyle:off +return Seq.empty[Row] +// scalastyle:on + } else { +throw new IllegalArgumentException(s"Table $targetTable already exists.") + } +} + +val sourceTableDesc = catalog.getTempViewOrPermanentTableMetadata(sourceTable) + +val newStorage = if (fileFormat.inputFormat.isDefined) { + fileFormat +} else { + sourceTableDesc.storage.copy(locationUri = fileFormat.locationUri) +} + +// If the location is specified, we create an external table internally. +// Otherwise create a managed table. +val tblType = if (newStorage.locationUri.isEmpty) { + CatalogTableType.MANAGED +} else { + CatalogTableType.EXTERNAL +} + +val targetTableProperties = if (sparkAdapter.isHoodieTable(sourceTableDesc)) { + HoodieOptionConfig.extractHoodieOptions(sourceTableDesc.properties) ++ properties +} else { + properties +} + +val newTableSchema = CharVarcharUtils.getRawSchema(sourceTableDesc.schema) +val newTableDesc = CatalogTable( + identifier = targetTable, + tableType = tblType, + storage = newStorage, + schema = newTableSchema, + provider = Some("hudi"), + partitionColumnNames = sourceTableDesc.partitionColumnNames, + bucketSpec = sourceTableDesc.bucketSpec, + properties = targetTableProperties, + tracksPartitionsInCatalog = sourceTableDesc.tracksPartitionsInCatalog) + +val hoodieCatalogTable = HoodieCatalogTable(sparkSession, newTableDesc) +// check if there are conflict between table configs defined in hoodie table and properties defined in catalog. +CreateHoodieTableCommand.validateTblProperties(hoodieCatalogTable) + +val queryAsProp = hoodieCatalogTable.catalogProperties.get(ConfigUtils.IS_QUERY_AS_RO_TABLE) +if (queryAsProp.isEmpty) { Review Comment: Does use specify this option through sql options? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete
hudi-bot commented on PR #9407: URL: https://github.com/apache/hudi/pull/9407#issuecomment-1672526657 ## CI report: * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228) * a4d6304bd5173e950413dc5d65a3d04be1144303 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19236) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6676) Add command for CreateHoodieTableLike
[ https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6676: - Labels: pull-request-available (was: ) > Add command for CreateHoodieTableLike > - > > Key: HUDI-6676 > URL: https://issues.apache.org/jira/browse/HUDI-6676 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > > 1. Create table from non-hudi table > 2. Create table from hudi table(The properties related to Hudi in the source > Hudi table will be carried over) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] boneanxs opened a new pull request, #9412: [HUDI-6676] Add command for CreateHoodieTableLike
boneanxs opened a new pull request, #9412: URL: https://github.com/apache/hudi/pull/9412 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ 1. Create table from non-hudi table 2. Create table from hudi table(The properties related to Hudi in the source Hudi table will be carried over) ### Impact _Describe any public API or user-facing feature change or any performance impact._ None ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wecharyu commented on a diff in pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
wecharyu commented on code in PR #9408: URL: https://github.com/apache/hudi/pull/9408#discussion_r1289521158 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala: ## @@ -330,23 +330,36 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport { val allPartitionPaths = hoodieCatalogTable.getPartitionPaths val enableHiveStylePartitioning = isHiveStyledPartitioning(allPartitionPaths, table) val enableEncodeUrl = isUrlEncodeEnabled(allPartitionPaths, table) -val partitionsToDrop = normalizedSpecs.map { spec => - hoodieCatalogTable.partitionFields.map { partitionColumn => -val encodedPartitionValue = if (enableEncodeUrl) { - PartitionPathEncodeUtils.escapePathName(spec(partitionColumn)) -} else { - spec(partitionColumn) -} -if (enableHiveStylePartitioning) { - partitionColumn + "=" + encodedPartitionValue -} else { Review Comment: Not remove it, just reuse the common implementation of `makePartitionPath` which will handle hive style and url encode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete
hudi-bot commented on PR #9407: URL: https://github.com/apache/hudi/pull/9407#issuecomment-1672520993 ## CI report: * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228) * a4d6304bd5173e950413dc5d65a3d04be1144303 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands
hudi-bot commented on PR #9411: URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672514880 ## CI report: * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19234) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19233) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672514851 ## CI report: * 65e9f9828da86e4558b1830493ead64366e69fae Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229) * 89c387c5edc9044786899bc1288e35121df600f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19235) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6676) Add command for CreateHoodieTableLike
[ https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An updated HUDI-6676: - Description: 1. Create table from non-hudi table 2. Create table from hudi table(The properties related to Hudi in the source Hudi table will be carried over) > Add command for CreateHoodieTableLike > - > > Key: HUDI-6676 > URL: https://issues.apache.org/jira/browse/HUDI-6676 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > > 1. Create table from non-hudi table > 2. Create table from hudi table(The properties related to Hudi in the source > Hudi table will be carried over) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #9407: asyncService log prompt incomplete
danny0405 commented on code in PR #9407: URL: https://github.com/apache/hudi/pull/9407#discussion_r1289510346 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java: ## @@ -196,11 +196,11 @@ public void waitTillPendingAsyncServiceInstantsReducesTo(int numPending) throws } /** - * Enqueues new pending clustering instant. + * Enqueues new pending compaction/clustering instant. Review Comment: ```suggestion * Enqueues new pending table service instant. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9407: asyncService log prompt incomplete
danny0405 commented on code in PR #9407: URL: https://github.com/apache/hudi/pull/9407#discussion_r1289510239 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java: ## @@ -196,11 +196,11 @@ public void waitTillPendingAsyncServiceInstantsReducesTo(int numPending) throws } /** - * Enqueues new pending clustering instant. + * Enqueues new pending compaction/clustering instant. * @param instant {@link HoodieInstant} to enqueue. */ public void enqueuePendingAsyncServiceInstant(HoodieInstant instant) { -LOG.info("Enqueuing new pending clustering instant: " + instant.getTimestamp()); +LOG.info("Enqueuing new pending compaction/clustering instant: " + instant.getTimestamp()); Review Comment: ```suggestion LOG.info("Enqueuing new pending table service instant: " + instant.getTimestamp()); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6676) Add command for CreateHoodieTableLike
Hui An created HUDI-6676: Summary: Add command for CreateHoodieTableLike Key: HUDI-6676 URL: https://issues.apache.org/jira/browse/HUDI-6676 Project: Apache Hudi Issue Type: Improvement Reporter: Hui An Assignee: Hui An -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #9403: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
danny0405 commented on code in PR #9403: URL: https://github.com/apache/hudi/pull/9403#discussion_r1289505376 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java: ## @@ -54,21 +56,23 @@ public static boolean shouldAddOffsets(TypedProperties props) { public static final String KAFKA_SOURCE_OFFSET_COLUMN = "_hoodie_kafka_source_offset"; public static final String KAFKA_SOURCE_PARTITION_COLUMN = "_hoodie_kafka_source_partition"; public static final String KAFKA_SOURCE_TIMESTAMP_COLUMN = "_hoodie_kafka_source_timestamp"; + public static final String KAFKA_SOURCE_KEY_COLUMN = "_hoodie_kafka_source_key"; public KafkaOffsetPostProcessor(TypedProperties props, JavaSparkContext jssc) { super(props, jssc); } @Override public Schema processSchema(Schema schema) { -// this method adds kafka offset fields namely source offset, partition and timestamp to the schema of the batch. +// this method adds kafka offset fields namely source offset, partition, timestamp and kafka message key to the schema of the batch. try { List fieldList = schema.getFields(); List newFieldList = fieldList.stream() .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), f.defaultVal())).collect(Collectors.toList()); newFieldList.add(new Schema.Field(KAFKA_SOURCE_OFFSET_COLUMN, Schema.create(Schema.Type.LONG), "offset column", 0)); newFieldList.add(new Schema.Field(KAFKA_SOURCE_PARTITION_COLUMN, Schema.create(Schema.Type.INT), "partition column", 0)); newFieldList.add(new Schema.Field(KAFKA_SOURCE_TIMESTAMP_COLUMN, Schema.create(Schema.Type.LONG), "timestamp column", 0)); + newFieldList.add(new Schema.Field(KAFKA_SOURCE_KEY_COLUMN, createNullableSchema(Schema.Type.STRING), "kafka key column", JsonProperties.NULL_VALUE)); Schema newSchema = Schema.createRecord(schema.getName() + "_processed", schema.getDoc(), schema.getNamespace(), false, newFieldList); Review Comment: The key is always a string type? Could it be bytes in Kafka ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9403: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
danny0405 commented on code in PR #9403: URL: https://github.com/apache/hudi/pull/9403#discussion_r1289505153 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java: ## @@ -175,9 +176,12 @@ public GenericRecord withKafkaFieldsAppended(ConsumerRecord consumerRecord) { for (Schema.Field field : record.getSchema().getFields()) { recordBuilder.set(field, record.get(field.name())); } + +String kafkaKey = String.valueOf(consumerRecord.key()); recordBuilder.set(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset()); recordBuilder.set(KAFKA_SOURCE_PARTITION_COLUMN, consumerRecord.partition()); recordBuilder.set(KAFKA_SOURCE_TIMESTAMP_COLUMN, consumerRecord.timestamp()); +recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, kafkaKey); Review Comment: ```suggestion recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, String.valueOf(consumerRecord.key())); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
danny0405 commented on code in PR #9408: URL: https://github.com/apache/hudi/pull/9408#discussion_r1289497305 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala: ## @@ -330,23 +330,36 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport { val allPartitionPaths = hoodieCatalogTable.getPartitionPaths val enableHiveStylePartitioning = isHiveStyledPartitioning(allPartitionPaths, table) val enableEncodeUrl = isUrlEncodeEnabled(allPartitionPaths, table) -val partitionsToDrop = normalizedSpecs.map { spec => - hoodieCatalogTable.partitionFields.map { partitionColumn => -val encodedPartitionValue = if (enableEncodeUrl) { - PartitionPathEncodeUtils.escapePathName(spec(partitionColumn)) -} else { - spec(partitionColumn) -} -if (enableHiveStylePartitioning) { - partitionColumn + "=" + encodedPartitionValue -} else { Review Comment: Why removing the handling of hive style partitioning? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #4974: [HUDI-3494] Consider triggering condition of MOR compaction during archival
Zouxxyy commented on PR #4974: URL: https://github.com/apache/hudi/pull/4974#issuecomment-1672490060 - The default triggering condition is the number of delta commits, with the config of hoodie.compact.inline.max.delta.commits. If this setting is larger than the archival config of hoodie.keep.max.commits, there is not enough delta commits in the active timeline and the compaction will never happen. why not just throw exception when `hoodie.compact.inline.max.delta.commits` > `hoodie.keep.max.commits` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672487101 ## CI report: * 65e9f9828da86e4558b1830493ead64366e69fae Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229) * 89c387c5edc9044786899bc1288e35121df600f9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6675) Clean action will delete the whole table
[ https://issues.apache.org/jira/browse/HUDI-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sanqingleo updated HUDI-6675: - Summary: Clean action will delete the whole table (was: InsertOverwrite will delete the whole table) > Clean action will delete the whole table > > > Key: HUDI-6675 > URL: https://issues.apache.org/jira/browse/HUDI-6675 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Affects Versions: 0.11.1, 0.13.0 > Environment: hudi 0.11 both 0.13. > spark 3.4 >Reporter: sanqingleo >Priority: Major > Attachments: image-2023-08-10-10-35-02-798.png, > image-2023-08-10-10-37-05-339.png > > > h1. Abstract > when I use inset_overwrite feature both in spark sql and api, It's will clean > the whole table when it's not partition table > then throw this exception > !image-2023-08-10-10-37-05-339.png! > h1. Version > # hudi 0.11 both 0.13. > # spark 3.4 > h1. Bug Position > org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult > !image-2023-08-10-10-35-02-798.png! > h1. How to recurrent > Need to run 4 times, fourth time will trigger clean action. > 0.11, both sql and api > 0.13 just api > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.spark.sql.types.{DataTypes, StructField, StructType} > import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} > object InsertOverwriteTest { > def main(array: Array[String]): Unit = { > val spark = SparkSession.builder() > .appName("TestInsertOverwrite") > .master("local[4]") > .config("spark.sql.extensions", > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.sql.catalog.spark_catalog" > ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog") > .getOrCreate() > spark.conf.set("hoodie.index.type", "BUCKET") > spark.conf.set("hoodie.storage.layout.type", "BUCKET") > spark.conf.set("HADOOP_USER_NAME", "parallels") > System.setProperty("HADOOP_USER_NAME", "parallels") > var seq = List( > Row("uuid_01", "27", "2022-09-23", "par_01"), > Row("uuid_02", "21", "2022-09-23", "par_02"), > Row("uuid_03", "23", "2022-09-23", "par_04"), > Row("uuid_04", "24", "2022-09-23", "par_02"), > Row("uuid_05", "26", "2022-09-23", "par_01"), > Row("uuid_06", "20", "2022-09-23", "par_03"), > ) > var rdd = spark.sparkContext.parallelize(seq) > var structType: StructType = StructType(Array( > StructField("uuid", DataTypes.StringType, nullable = true), > StructField("age", DataTypes.StringType, nullable = true), > StructField("ts", DataTypes.StringType, nullable = true), > StructField("par", DataTypes.StringType, nullable = true) > )) > var df1 = spark.createDataFrame(rdd, structType) > .createOrReplaceTempView("compact_test_num") > var df: DataFrame = spark.sql(" select uuid, age, ts, par from > compact_test_num limit 10") > df.write.format("org.apache.hudi") > .option(RECORDKEY_FIELD.key, "uuid") > .option(PRECOMBINE_FIELD.key, "ts") > // .option(PARTITIONPATH_FIELD.key(), "par") > .option("hoodie.table.keygenerator.class", > "org.apache.hudi.keygen.NonpartitionedKeyGenerator") > .option(KEYGENERATOR_CLASS_NAME.key, > "org.apache.hudi.keygen.NonpartitionedKeyGenerator") > // .option(KEYGENERATOR_CLASS_NAME.key, > "org.apache.hudi.keygen.ComplexKeyGenerator") > .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL) > .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL) > .option("hoodie.metadata.enable", "false") > .option("hoodie.index.type", "BUCKET") > .option("hoodie.bucket.index.hash.field", "uuid") > .option("hoodie.bucket.index.num.buckets", "2") > .option("hoodie.storage.layout.type", "BUCKET") > .option("hoodie.storage.layout.partitioner.class", > "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner") > .option("hoodie.table.name", "cow_20230801_012") > .option("hoodie.upsert.shuffle.parallelism", "2") > .option("hoodie.insert.shuffle.parallelism", "2") > .option("hoodie.delete.shuffle.parallelism", "2") > .option("hoodie.clean.max.commits", "2") > .option("hoodie.cleaner.commits.retained", "2") > .option("hoodie.datasource.write.hive_style_partitioning", "true") > .mode(SaveMode.Append) > .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012") > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HUDI-6675) InsertOverwrite will delete the whole table
[ https://issues.apache.org/jira/browse/HUDI-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sanqingleo reopened HUDI-6675: -- > InsertOverwrite will delete the whole table > --- > > Key: HUDI-6675 > URL: https://issues.apache.org/jira/browse/HUDI-6675 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Affects Versions: 0.11.1, 0.13.0 > Environment: hudi 0.11 both 0.13. > spark 3.4 >Reporter: sanqingleo >Priority: Major > Attachments: image-2023-08-10-10-35-02-798.png, > image-2023-08-10-10-37-05-339.png > > > h1. Abstract > when I use inset_overwrite feature both in spark sql and api, It's will clean > the whole table when it's not partition table > then throw this exception > !image-2023-08-10-10-37-05-339.png! > h1. Version > # hudi 0.11 both 0.13. > # spark 3.4 > h1. Bug Position > org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult > !image-2023-08-10-10-35-02-798.png! > h1. How to recurrent > Need to run 4 times, fourth time will trigger clean action. > 0.11, both sql and api > 0.13 just api > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.spark.sql.types.{DataTypes, StructField, StructType} > import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} > object InsertOverwriteTest { > def main(array: Array[String]): Unit = { > val spark = SparkSession.builder() > .appName("TestInsertOverwrite") > .master("local[4]") > .config("spark.sql.extensions", > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.sql.catalog.spark_catalog" > ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog") > .getOrCreate() > spark.conf.set("hoodie.index.type", "BUCKET") > spark.conf.set("hoodie.storage.layout.type", "BUCKET") > spark.conf.set("HADOOP_USER_NAME", "parallels") > System.setProperty("HADOOP_USER_NAME", "parallels") > var seq = List( > Row("uuid_01", "27", "2022-09-23", "par_01"), > Row("uuid_02", "21", "2022-09-23", "par_02"), > Row("uuid_03", "23", "2022-09-23", "par_04"), > Row("uuid_04", "24", "2022-09-23", "par_02"), > Row("uuid_05", "26", "2022-09-23", "par_01"), > Row("uuid_06", "20", "2022-09-23", "par_03"), > ) > var rdd = spark.sparkContext.parallelize(seq) > var structType: StructType = StructType(Array( > StructField("uuid", DataTypes.StringType, nullable = true), > StructField("age", DataTypes.StringType, nullable = true), > StructField("ts", DataTypes.StringType, nullable = true), > StructField("par", DataTypes.StringType, nullable = true) > )) > var df1 = spark.createDataFrame(rdd, structType) > .createOrReplaceTempView("compact_test_num") > var df: DataFrame = spark.sql(" select uuid, age, ts, par from > compact_test_num limit 10") > df.write.format("org.apache.hudi") > .option(RECORDKEY_FIELD.key, "uuid") > .option(PRECOMBINE_FIELD.key, "ts") > // .option(PARTITIONPATH_FIELD.key(), "par") > .option("hoodie.table.keygenerator.class", > "org.apache.hudi.keygen.NonpartitionedKeyGenerator") > .option(KEYGENERATOR_CLASS_NAME.key, > "org.apache.hudi.keygen.NonpartitionedKeyGenerator") > // .option(KEYGENERATOR_CLASS_NAME.key, > "org.apache.hudi.keygen.ComplexKeyGenerator") > .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL) > .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL) > .option("hoodie.metadata.enable", "false") > .option("hoodie.index.type", "BUCKET") > .option("hoodie.bucket.index.hash.field", "uuid") > .option("hoodie.bucket.index.num.buckets", "2") > .option("hoodie.storage.layout.type", "BUCKET") > .option("hoodie.storage.layout.partitioner.class", > "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner") > .option("hoodie.table.name", "cow_20230801_012") > .option("hoodie.upsert.shuffle.parallelism", "2") > .option("hoodie.insert.shuffle.parallelism", "2") > .option("hoodie.delete.shuffle.parallelism", "2") > .option("hoodie.clean.max.commits", "2") > .option("hoodie.cleaner.commits.retained", "2") > .option("hoodie.datasource.write.hive_style_partitioning", "true") > .mode(SaveMode.Append) > .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012") > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6675) InsertOverwrite will delete the whole table
[ https://issues.apache.org/jira/browse/HUDI-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sanqingleo resolved HUDI-6675. -- > InsertOverwrite will delete the whole table > --- > > Key: HUDI-6675 > URL: https://issues.apache.org/jira/browse/HUDI-6675 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Affects Versions: 0.11.1, 0.13.0 > Environment: hudi 0.11 both 0.13. > spark 3.4 >Reporter: sanqingleo >Priority: Major > Attachments: image-2023-08-10-10-35-02-798.png, > image-2023-08-10-10-37-05-339.png > > > h1. Abstract > when I use inset_overwrite feature both in spark sql and api, It's will clean > the whole table when it's not partition table > then throw this exception > !image-2023-08-10-10-37-05-339.png! > h1. Version > # hudi 0.11 both 0.13. > # spark 3.4 > h1. Bug Position > org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult > !image-2023-08-10-10-35-02-798.png! > h1. How to recurrent > Need to run 4 times, fourth time will trigger clean action. > 0.11, both sql and api > 0.13 just api > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.spark.sql.types.{DataTypes, StructField, StructType} > import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} > object InsertOverwriteTest { > def main(array: Array[String]): Unit = { > val spark = SparkSession.builder() > .appName("TestInsertOverwrite") > .master("local[4]") > .config("spark.sql.extensions", > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.sql.catalog.spark_catalog" > ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog") > .getOrCreate() > spark.conf.set("hoodie.index.type", "BUCKET") > spark.conf.set("hoodie.storage.layout.type", "BUCKET") > spark.conf.set("HADOOP_USER_NAME", "parallels") > System.setProperty("HADOOP_USER_NAME", "parallels") > var seq = List( > Row("uuid_01", "27", "2022-09-23", "par_01"), > Row("uuid_02", "21", "2022-09-23", "par_02"), > Row("uuid_03", "23", "2022-09-23", "par_04"), > Row("uuid_04", "24", "2022-09-23", "par_02"), > Row("uuid_05", "26", "2022-09-23", "par_01"), > Row("uuid_06", "20", "2022-09-23", "par_03"), > ) > var rdd = spark.sparkContext.parallelize(seq) > var structType: StructType = StructType(Array( > StructField("uuid", DataTypes.StringType, nullable = true), > StructField("age", DataTypes.StringType, nullable = true), > StructField("ts", DataTypes.StringType, nullable = true), > StructField("par", DataTypes.StringType, nullable = true) > )) > var df1 = spark.createDataFrame(rdd, structType) > .createOrReplaceTempView("compact_test_num") > var df: DataFrame = spark.sql(" select uuid, age, ts, par from > compact_test_num limit 10") > df.write.format("org.apache.hudi") > .option(RECORDKEY_FIELD.key, "uuid") > .option(PRECOMBINE_FIELD.key, "ts") > // .option(PARTITIONPATH_FIELD.key(), "par") > .option("hoodie.table.keygenerator.class", > "org.apache.hudi.keygen.NonpartitionedKeyGenerator") > .option(KEYGENERATOR_CLASS_NAME.key, > "org.apache.hudi.keygen.NonpartitionedKeyGenerator") > // .option(KEYGENERATOR_CLASS_NAME.key, > "org.apache.hudi.keygen.ComplexKeyGenerator") > .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL) > .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL) > .option("hoodie.metadata.enable", "false") > .option("hoodie.index.type", "BUCKET") > .option("hoodie.bucket.index.hash.field", "uuid") > .option("hoodie.bucket.index.num.buckets", "2") > .option("hoodie.storage.layout.type", "BUCKET") > .option("hoodie.storage.layout.partitioner.class", > "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner") > .option("hoodie.table.name", "cow_20230801_012") > .option("hoodie.upsert.shuffle.parallelism", "2") > .option("hoodie.insert.shuffle.parallelism", "2") > .option("hoodie.delete.shuffle.parallelism", "2") > .option("hoodie.clean.max.commits", "2") > .option("hoodie.cleaner.commits.retained", "2") > .option("hoodie.datasource.write.hive_style_partitioning", "true") > .mode(SaveMode.Append) > .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012") > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6675) InsertOverwrite will delete the whole table
sanqingleo created HUDI-6675: Summary: InsertOverwrite will delete the whole table Key: HUDI-6675 URL: https://issues.apache.org/jira/browse/HUDI-6675 Project: Apache Hudi Issue Type: Bug Components: cleaning Affects Versions: 0.13.0, 0.11.1 Environment: hudi 0.11 both 0.13. spark 3.4 Reporter: sanqingleo Attachments: image-2023-08-10-10-35-02-798.png, image-2023-08-10-10-37-05-339.png h1. Abstract when I use inset_overwrite feature both in spark sql and api, It's will clean the whole table when it's not partition table then throw this exception !image-2023-08-10-10-37-05-339.png! h1. Version # hudi 0.11 both 0.13. # spark 3.4 h1. Bug Position org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult !image-2023-08-10-10-35-02-798.png! h1. How to recurrent Need to run 4 times, fourth time will trigger clean action. 0.11, both sql and api 0.13 just api {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceWriteOptions._ import org.apache.spark.sql.types.{DataTypes, StructField, StructType} import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} object InsertOverwriteTest { def main(array: Array[String]): Unit = { val spark = SparkSession.builder() .appName("TestInsertOverwrite") .master("local[4]") .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.catalog.spark_catalog" ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog") .getOrCreate() spark.conf.set("hoodie.index.type", "BUCKET") spark.conf.set("hoodie.storage.layout.type", "BUCKET") spark.conf.set("HADOOP_USER_NAME", "parallels") System.setProperty("HADOOP_USER_NAME", "parallels") var seq = List( Row("uuid_01", "27", "2022-09-23", "par_01"), Row("uuid_02", "21", "2022-09-23", "par_02"), Row("uuid_03", "23", "2022-09-23", "par_04"), Row("uuid_04", "24", "2022-09-23", "par_02"), Row("uuid_05", "26", "2022-09-23", "par_01"), Row("uuid_06", "20", "2022-09-23", "par_03"), ) var rdd = spark.sparkContext.parallelize(seq) var structType: StructType = StructType(Array( StructField("uuid", DataTypes.StringType, nullable = true), StructField("age", DataTypes.StringType, nullable = true), StructField("ts", DataTypes.StringType, nullable = true), StructField("par", DataTypes.StringType, nullable = true) )) var df1 = spark.createDataFrame(rdd, structType) .createOrReplaceTempView("compact_test_num") var df: DataFrame = spark.sql(" select uuid, age, ts, par from compact_test_num limit 10") df.write.format("org.apache.hudi") .option(RECORDKEY_FIELD.key, "uuid") .option(PRECOMBINE_FIELD.key, "ts") // .option(PARTITIONPATH_FIELD.key(), "par") .option("hoodie.table.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator") .option(KEYGENERATOR_CLASS_NAME.key, "org.apache.hudi.keygen.NonpartitionedKeyGenerator") // .option(KEYGENERATOR_CLASS_NAME.key, "org.apache.hudi.keygen.ComplexKeyGenerator") .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL) .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL) .option("hoodie.metadata.enable", "false") .option("hoodie.index.type", "BUCKET") .option("hoodie.bucket.index.hash.field", "uuid") .option("hoodie.bucket.index.num.buckets", "2") .option("hoodie.storage.layout.type", "BUCKET") .option("hoodie.storage.layout.partitioner.class", "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner") .option("hoodie.table.name", "cow_20230801_012") .option("hoodie.upsert.shuffle.parallelism", "2") .option("hoodie.insert.shuffle.parallelism", "2") .option("hoodie.delete.shuffle.parallelism", "2") .option("hoodie.clean.max.commits", "2") .option("hoodie.cleaner.commits.retained", "2") .option("hoodie.datasource.write.hive_style_partitioning", "true") .mode(SaveMode.Append) .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012") } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] someguyLi commented on issue #9363: [SUPPORT] Streaming query loss delete data
someguyLi commented on issue #9363: URL: https://github.com/apache/hudi/issues/9363#issuecomment-1672439581 > The Hudi table is used like a message queue, so TTL is a general solution for keepping the records aliveness. There is no good solution for this, for Kafka, they throws exception or allow the consumer to fallback to latest/oldest offset for recovering, but both of these means do not work very well for changelog, because any change loss would incur incorrect results. Thanks for your support, i will find another way -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands
hudi-bot commented on PR #9411: URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672423674 ## CI report: * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19234) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19233) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8327: [HUDI-5361] Propagate all hoodie configs from spark sqlconf, but don't overwrite values already set
hudi-bot commented on PR #8327: URL: https://github.com/apache/hudi/pull/8327#issuecomment-1672385859 ## CI report: * 94e4c2e74c6170ceee8c303f7237bd10f2cd334f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19232) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test
hudi-bot commented on PR #9410: URL: https://github.com/apache/hudi/pull/9410#issuecomment-1672381897 ## CI report: * a3bd3418eccb373f200139996d34b8cc71913a62 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19231) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
hudi-bot commented on PR #9409: URL: https://github.com/apache/hudi/pull/9409#issuecomment-1672296128 ## CI report: * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands
hudi-bot commented on PR #9411: URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672282598 ## CI report: * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19234) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8327: [HUDI-5361] Propagate all hoodie configs from spark sqlconf, but don't overwrite values already set
hudi-bot commented on PR #8327: URL: https://github.com/apache/hudi/pull/8327#issuecomment-1672232135 ## CI report: * b3388a3bb559227d2415e747681326f6109b4cc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15998) * 94e4c2e74c6170ceee8c303f7237bd10f2cd334f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19232) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] neeruks commented on issue #5348: [SUPPORT]org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220418194506064
neeruks commented on issue #5348: URL: https://github.com/apache/hudi/issues/5348#issuecomment-1672230698 I am also getting the same error. I am using Glue to read the CSV file and write it into a Hudi table. py4j.protocol.Py4JJavaError: An error occurred while calling o326.save. : org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20230809204110303 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] neeruks commented on issue #2970: [SUPPORT] Failed to upsert for commit time
neeruks commented on issue #2970: URL: https://github.com/apache/hudi/issues/2970#issuecomment-1672228563 I am also getting the same error. I am using Glue to read the CSV file and write it into a Hudi table. py4j.protocol.Py4JJavaError: An error occurred while calling o326.save. : org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20230809204110303 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test
hudi-bot commented on PR #9410: URL: https://github.com/apache/hudi/pull/9410#issuecomment-1672223948 ## CI report: * a3bd3418eccb373f200139996d34b8cc71913a62 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19231) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands
hudi-bot commented on PR #9411: URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672224229 ## CI report: * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8327: [HUDI-5361] Propagate all hoodie configs from spark sqlconf, but don't overwrite values already set
hudi-bot commented on PR #8327: URL: https://github.com/apache/hudi/pull/8327#issuecomment-1672221796 ## CI report: * b3388a3bb559227d2415e747681326f6109b4cc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15998) * 94e4c2e74c6170ceee8c303f7237bd10f2cd334f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test
hudi-bot commented on PR #9410: URL: https://github.com/apache/hudi/pull/9410#issuecomment-1672215254 ## CI report: * a3bd3418eccb373f200139996d34b8cc71913a62 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6674) Add rollback info from metadata table in timeline commands
[ https://issues.apache.org/jira/browse/HUDI-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6674: - Labels: pull-request-available (was: ) > Add rollback info from metadata table in timeline commands > -- > > Key: HUDI-6674 > URL: https://issues.apache.org/jira/browse/HUDI-6674 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua opened a new pull request, #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands
yihua opened a new pull request, #9411: URL: https://github.com/apache/hudi/pull/9411 ### Change Logs This PR adds the rollback information from the metadata table to the output of the timeline commands in Hudi CLI, given that metadata data table also encounters more rollbacks now. To make the table concise, the rollback information is added to the "Action" (for data table) or "MT Action" (for metadata table) column, instead of having an independent column showing the information. Here's the new output: ``` hudi:hoodie_table->timeline show active --limit 200 --show-time-seconds --show-rollback-info ╔═╤═══╤═══╤═══╤╤╤╗ ║ No. │ Instant │ Action│ State │ Requested │ Inflight │ Completed ║ ║ │ │ │ │ Time │ Time │ Time ║ ╠═╪═══╪═══╪═══╪╪╪╣ ... ╟─┼───┼───┼───┼┼┼╢ ║ 11 │ 20230807154601569 │ rollback │ COMPLETED │ 08-07 08:46:03 │ 08-07 08:46:03 │ 08-07 08:47:58 ║ ║ │ │ Rolls back│ ││ │║ ║ │ │ 20230807154346625 │ ││ │║ ╟─┼───┼───┼───┼┼┼╢ ║ 12 │ 20230807154947753 │ rollback │ COMPLETED │ 08-07 08:49:49 │ 08-07 08:49:49 │ 08-07 08:51:46 ║ ║ │ │ Rolls back│ ││ │║ ║ │ │ 20230807154720087 │ ││ │║ ╟─┼───┼───┼───┼┼┼╢ ║ 13 │ 20230807155105131 │ commit│ COMPLETED │ 08-07 08:51:47 │ 08-07 08:54:29 │ 08-07 08:55:42 ║ ╟─┼───┼───┼───┼┼┼╢ ... hudi:hoodie_table->timeline show active --with-metadata-table --limit 200 --show-time-seconds ╔═╤══╤═══╤═══╤╤╤╤══╤═══╤╤╤╗ ║ No. │ Instant │ Action│ State │ Requested │ Inflight │ Completed │ MT │ MT│ MT │ MT │ MT ║ ║ │ │ │ │ Time │ Time │ Time │ Action │ State │ Requested │ Inflight │ Completed ║ ║ │ │ │ │ │││ │ │ Time │ Time │ Time ║ ╠═╪══╪═══╪═══╪╪╪╪══╪═══╪╪╪╣ ... ╟─┼──┼───┼───┼┼┼┼──┼───┼┼┼╢ ║ 66 │ 20230807155157772│ - │ - │ - │ - │ - │ rollback │ COMPLETED │ 08-07 08:51:59 │ 08-07 08:52:00 │ 08-07 08:52:01 ║ ║ │ │ │ │ │││ Rolls back │ │ ││║ ║ │ │ │ │ │││ 20230807154406919│ │ ││║ ╟─┼──┼───┼───┼┼┼┼──┼───┼┼┼╢ ║ 67 │ 20230807155547486│ commit│ INFLIGHT │ 08-07 08:56:06 │ 08-07 08:58:27 │ - │ -│ - │ - │ - │ - ║ ║ │ │ Rolled back by│ │ │││ │ │ ││║ ║ │ │ 20230807160141230 │
[jira] [Updated] (HUDI-6674) Add rollback info from metadata table in timeline commands
[ https://issues.apache.org/jira/browse/HUDI-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6674: Fix Version/s: 0.14.0 > Add rollback info from metadata table in timeline commands > -- > > Key: HUDI-6674 > URL: https://issues.apache.org/jira/browse/HUDI-6674 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6674) Add rollback info from metadata table in timeline commands
Ethan Guo created HUDI-6674: --- Summary: Add rollback info from metadata table in timeline commands Key: HUDI-6674 URL: https://issues.apache.org/jira/browse/HUDI-6674 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6674) Add rollback info from metadata table in timeline commands
[ https://issues.apache.org/jira/browse/HUDI-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6674: --- Assignee: Ethan Guo > Add rollback info from metadata table in timeline commands > -- > > Key: HUDI-6674 > URL: https://issues.apache.org/jira/browse/HUDI-6674 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong
[ https://issues.apache.org/jira/browse/HUDI-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6673: -- Status: In Progress (was: Open) > Spark SQL core flow test incremental query syntax is wrong > -- > > Key: HUDI-6673 > URL: https://issues.apache.org/jira/browse/HUDI-6673 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql, tests-ci >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > missing the incremental format argument -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong
[ https://issues.apache.org/jira/browse/HUDI-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6673: -- Status: Patch Available (was: In Progress) > Spark SQL core flow test incremental query syntax is wrong > -- > > Key: HUDI-6673 > URL: https://issues.apache.org/jira/browse/HUDI-6673 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql, tests-ci >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > missing the incremental format argument -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test
jonvex opened a new pull request, #9410: URL: https://github.com/apache/hudi/pull/9410 ### Change Logs Test runs now ### Impact Testing for release ### Risk level (write none, low medium or high below) none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong
[ https://issues.apache.org/jira/browse/HUDI-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6673: - Labels: pull-request-available (was: ) > Spark SQL core flow test incremental query syntax is wrong > -- > > Key: HUDI-6673 > URL: https://issues.apache.org/jira/browse/HUDI-6673 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql, tests-ci >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > missing the incremental format argument -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong
Jonathan Vexler created HUDI-6673: - Summary: Spark SQL core flow test incremental query syntax is wrong Key: HUDI-6673 URL: https://issues.apache.org/jira/browse/HUDI-6673 Project: Apache Hudi Issue Type: Bug Components: spark-sql, tests-ci Reporter: Jonathan Vexler Assignee: Jonathan Vexler missing the incremental format argument -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
hudi-bot commented on PR #9409: URL: https://github.com/apache/hudi/pull/9409#issuecomment-1672081146 ## CI report: * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672081089 ## CI report: * 65e9f9828da86e4558b1830493ead64366e69fae Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
hudi-bot commented on PR #9409: URL: https://github.com/apache/hudi/pull/9409#issuecomment-1672070399 ## CI report: * d567d80ea610ed8eca248901d310bd40ae4bf8e5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete
hudi-bot commented on PR #9407: URL: https://github.com/apache/hudi/pull/9407#issuecomment-1672058852 ## CI report: * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] emkornfield commented on issue #9355: [SUPPORT] Problem while reading from BQ tables which are synced on Hudi table
emkornfield commented on issue #9355: URL: https://github.com/apache/hudi/issues/9355#issuecomment-1672050574 This sounds like the likely cause. The solution that uses a view for compatibility with Hudi is inherently flawed. Using the newly contributed [manifest file](https://cloud.google.com/bigquery/docs/query-open-table-format-using-manifest-files) approach is going to be more robust along several dimensions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6663) Investigate Bootstrap Performance
[ https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler reassigned HUDI-6663: - Assignee: Jonathan Vexler > Investigate Bootstrap Performance > - > > Key: HUDI-6663 > URL: https://issues.apache.org/jira/browse/HUDI-6663 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Bootstrap performance seems slow even though reader schemas look correct -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6663) Investigate Bootstrap Performance
[ https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6663: -- Status: In Progress (was: Open) > Investigate Bootstrap Performance > - > > Key: HUDI-6663 > URL: https://issues.apache.org/jira/browse/HUDI-6663 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Bootstrap performance seems slow even though reader schemas look correct -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6663) Investigate Bootstrap Performance
[ https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6663: -- Status: Patch Available (was: In Progress) > Investigate Bootstrap Performance > - > > Key: HUDI-6663 > URL: https://issues.apache.org/jira/browse/HUDI-6663 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Bootstrap performance seems slow even though reader schemas look correct -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
jonvex opened a new pull request, #9409: URL: https://github.com/apache/hudi/pull/9409 ### Change Logs Remove the broadcast when sending the file slices. ### Impact 1 TB tpcds bootstrap queries 1-14 performance gap between new file format and fast bootstrap went from 2.23x to 1.01x. Similar performance gains expected for MOR table where many file slices have log files ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6663) Investigate Bootstrap Performance
[ https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6663: - Labels: pull-request-available (was: ) > Investigate Bootstrap Performance > - > > Key: HUDI-6663 > URL: https://issues.apache.org/jira/browse/HUDI-6663 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Bootstrap performance seems slow even though reader schemas look correct -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1671887323 ## CI report: * 65e9f9828da86e4558b1830493ead64366e69fae Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on issue #9355: [SUPPORT] Problem while reading from BQ tables which are synced on Hudi table
the-other-tim-brown commented on issue #9355: URL: https://github.com/apache/hudi/issues/9355#issuecomment-1671851547 @ranjanankur I'm taking a look at this and tracking with the JIRA ticket here as well https://issues.apache.org/jira/browse/HUDI-6672 I've reached out to the Google Cloud to confirm that this is an issue with updating the manifest while a query is running. The solution I'm working on will version these manifests so we do not modify the file while a query is in flight. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6672) BigQuery Sync updates while queries running cause failures
Timothy Brown created HUDI-6672: --- Summary: BigQuery Sync updates while queries running cause failures Key: HUDI-6672 URL: https://issues.apache.org/jira/browse/HUDI-6672 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Issue was reported by the user here: [https://github.com/apache/hudi/issues/9355] It looks like we are updating the underlying manifest file while there is a query executing causing issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6672) BigQuery Sync updates while queries running cause failures
[ https://issues.apache.org/jira/browse/HUDI-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6672: --- Assignee: Timothy Brown > BigQuery Sync updates while queries running cause failures > -- > > Key: HUDI-6672 > URL: https://issues.apache.org/jira/browse/HUDI-6672 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Issue was reported by the user here: > [https://github.com/apache/hudi/issues/9355] > > It looks like we are updating the underlying manifest file while there is a > query executing causing issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1671830253 ## CI report: * 65e9f9828da86e4558b1830493ead64366e69fae UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete
hudi-bot commented on PR #9407: URL: https://github.com/apache/hudi/pull/9407#issuecomment-1671830173 ## CI report: * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete
hudi-bot commented on PR #9407: URL: https://github.com/apache/hudi/pull/9407#issuecomment-1671817231 ## CI report: * ce0c6dd5877e222dd64ce5ac6434d81168c08727 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6671) Support 'alter table add partition' sql
[ https://issues.apache.org/jira/browse/HUDI-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6671: - Labels: pull-request-available (was: ) > Support 'alter table add partition' sql > --- > > Key: HUDI-6671 > URL: https://issues.apache.org/jira/browse/HUDI-6671 > Project: Apache Hudi > Issue Type: Bug > Components: hudi-utilities >Reporter: Wechar >Priority: Major > Labels: pull-request-available > > Hoodie does not support 'add partition' sql now, so we can not get partitions > added by 'add partition' command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wecharyu opened a new pull request, #9408: [HUDI-6671] Support 'alter table add partition' sql
wecharyu opened a new pull request, #9408: URL: https://github.com/apache/hudi/pull/9408 ### Change Logs Hoodie does not support 'add partition' sql now, so we can not get partitions added by 'add partition' command. In this patch, we implement add partition in Hoodie side: 1. add new command `AlterHoodieTableAddPartitionCommand` 2. add new unit test `TestAlterTableAddPartition` ### Impact No ### Risk level (write none, low medium or high below) Low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6671) Support 'alter table add partition' sql
Wechar created HUDI-6671: Summary: Support 'alter table add partition' sql Key: HUDI-6671 URL: https://issues.apache.org/jira/browse/HUDI-6671 Project: Apache Hudi Issue Type: Bug Components: hudi-utilities Reporter: Wechar Hoodie does not support 'add partition' sql now, so we can not get partitions added by 'add partition' command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] empcl opened a new pull request, #9407: asyncService log prompt incomplete
empcl opened a new pull request, #9407: URL: https://github.com/apache/hudi/pull/9407 ### Change Logs asyncService log prompt incomplete ### Impact asyncService log prompt incomplete ### Risk level (write none, low medium or high below) none -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
hudi-bot commented on PR #9395: URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671580470 ## CI report: * f20fe8b171dc78a61639c1eabd7c5e5b4bbac201 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19227) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] andreacfm commented on issue #9354: [SUPPORT] HoodieDeltaStreamer fails to load org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning
andreacfm commented on issue #9354: URL: https://github.com/apache/hudi/issues/9354#issuecomment-1671565665 @ad1happy2go when trying to compile for spark 3.3 I get this error: ``` [ERROR] COMPILATION ERROR : [INFO] - [ERROR] cannot access org.apache.hadoop.shaded.org.apache.avro.reflect.Stringable class file for org.apache.hadoop.shaded.org.apache.avro.reflect.Stringable not found [ERROR] /Users/andrea/code/repos/hudi/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:[240,9] no suitable method found for collect(java.util.stream.Collector,capture#1 of ?,java.util.Map>>) method java.util.stream.Stream.collect(java.util.function.Supplier,java.util.function.BiConsumer,java.util.function.BiConsumer) is not applicable (cannot infer type-variable(s) R (actual and formal argument lists differ in length)) method java.util.stream.Stream.collect(java.util.stream.Collector) is not applicable (cannot infer type-variable(s) R,A (argument mismatch; java.util.stream.Collector,capture#1 of ?,java.util.Map>> cannot be converted to java.util.stream.Collector)) ``` Command is ``` mvn clean package -DskipTests -Dspark3.3 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
stream2000 commented on code in PR #9395: URL: https://github.com/apache/hudi/pull/9395#discussion_r1288537503 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java: ## @@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() { @Override public List map(List data, SerializableFunction func, int parallelism) { -return data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList()); +return stream(data, parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList()); } Review Comment: > @stream2000, the parallelism of `stream().parallel()` is only `Runtime.getRuntime().availableProcessors()` So actually the parallelism of `availableProcessors` will cause OOM? If I configure the parallelism just as `Runtime.getRuntime().availableProcessors()` we still get OOM right? Correct me if I'm wrong~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline
hudi-bot commented on PR #9209: URL: https://github.com/apache/hudi/pull/9209#issuecomment-1671364778 ## CI report: * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN * 803df61d0d04f7e7403d1177325a365e9bbafab5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19226) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
SteNicholas commented on code in PR #9395: URL: https://github.com/apache/hudi/pull/9395#discussion_r1288493882 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java: ## @@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() { @Override public List map(List data, SerializableFunction func, int parallelism) { -return data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList()); +return stream(data, parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList()); } Review Comment: @stream2000, the parallelism of `stream().parallel()` is only `Runtime.getRuntime().availableProcessors()`, then when the parallelism is 1000, it will cause `OutOfMemoryError` by using `stream().parallel()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
SteNicholas commented on code in PR #9395: URL: https://github.com/apache/hudi/pull/9395#discussion_r1288493882 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java: ## @@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() { @Override public List map(List data, SerializableFunction func, int parallelism) { -return data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList()); +return stream(data, parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList()); } Review Comment: @stream2000, the parallelism of `stream().parallel()` is only `Runtime.getRuntime().availableProcessors()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
hudi-bot commented on PR #9395: URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671265476 ## CI report: * a60f7f89b5377119bf8bef6c7ddfd0dc821de1fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19213) * f20fe8b171dc78a61639c1eabd7c5e5b4bbac201 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19227) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
hudi-bot commented on PR #9395: URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671200936 ## CI report: * a60f7f89b5377119bf8bef6c7ddfd0dc821de1fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19213) * f20fe8b171dc78a61639c1eabd7c5e5b4bbac201 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
stream2000 commented on code in PR #9395: URL: https://github.com/apache/hudi/pull/9395#discussion_r1288371818 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java: ## @@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() { @Override public List map(List data, SerializableFunction func, int parallelism) { -return data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList()); +return stream(data, parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList()); } Review Comment: When the parallelism is 1000, will we run it in 1000 parallelism or `Runtime.getRuntime().availableProcessors()`? If we use just the parallelism as `Runtime.getRuntime().availableProcessors()`, will it cause `OutOfMemoryError` since it it not actually a large parallelism? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9403: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
hudi-bot commented on PR #9403: URL: https://github.com/apache/hudi/pull/9403#issuecomment-1671187660 ## CI report: * 55da0942b542c664e49c7ab9ca9698dfbf67968e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19224) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
SteNicholas commented on PR #9395: URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671137882 @stream2000, thanks for the fix. I have rebased the lastest master branch. cc @danny0405. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
SteNicholas commented on code in PR #9395: URL: https://github.com/apache/hudi/pull/9395#discussion_r1288326850 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java: ## @@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() { @Override public List map(List data, SerializableFunction func, int parallelism) { -return data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList()); +return stream(data, parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList()); } Review Comment: @stream2000, I don't think only run the map function sequentially. In order to reduce the OOM risk caused by parallelization, all functions should be handled in this way. For example, when the parallelism is 1000, the `CleanPlanActionExecutor` uses map function and may casue `OutOfMemoryError` for many filegroups in `stream().parallel()`, which stacktrace like above description. BTW, the parallel stream only improves a little performance, therefore this change doesn't destroy much performance improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline
hudi-bot commented on PR #9209: URL: https://github.com/apache/hudi/pull/9209#issuecomment-1671129502 ## CI report: * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN * 57c1b843608a9b63d143ead5dd5168613bb13969 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19027) * 803df61d0d04f7e7403d1177325a365e9bbafab5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19226) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline
hudi-bot commented on PR #9209: URL: https://github.com/apache/hudi/pull/9209#issuecomment-1671117746 ## CI report: * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN * 57c1b843608a9b63d143ead5dd5168613bb13969 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19027) * 803df61d0d04f7e7403d1177325a365e9bbafab5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9405: [HUDI-6670] Fix timeline check in metadata table validator
hudi-bot commented on PR #9405: URL: https://github.com/apache/hudi/pull/9405#issuecomment-1671103541 ## CI report: * fc027c28476d50737566c3b714a4d58c38c39ff9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19222) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288283343 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java: ## @@ -18,75 +18,127 @@ package org.apache.hudi.common.table.timeline; -import org.apache.hudi.avro.HoodieAvroUtils; -import org.apache.hudi.avro.model.HoodieArchivedMetaEntry; -import org.apache.hudi.avro.model.HoodieMergeArchiveFilePlan; -import org.apache.hudi.common.fs.HoodieWrapperFileSystem; -import org.apache.hudi.common.model.HoodieLogFile; -import org.apache.hudi.common.model.HoodiePartitionMetadata; -import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.model.HoodieArchivedManifest; import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType; import org.apache.hudi.common.table.HoodieTableMetaClient; -import org.apache.hudi.common.table.log.HoodieLogFormat; -import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock; -import org.apache.hudi.common.table.log.block.HoodieLogBlock; -import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.common.util.ArchivedInstantReadSchemas; import org.apache.hudi.common.util.CollectionUtils; import org.apache.hudi.common.util.FileIOUtils; import org.apache.hudi.common.util.Option; -import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.avro.Schema; import org.apache.avro.generic.GenericRecord; import org.apache.avro.generic.IndexedRecord; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import javax.annotation.Nonnull; +import javax.annotation.Nullable; import java.io.IOException; import java.io.Serializable; +import java.nio.ByteBuffer; import java.nio.charset.StandardCharsets; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; -import java.util.Spliterator; -import java.util.Spliterators; +import java.util.concurrent.ConcurrentHashMap; import java.util.function.Function; import java.util.regex.Matcher; import java.util.regex.Pattern; -import java.util.stream.StreamSupport; +import java.util.stream.Collectors; /** - * Represents the Archived Timeline for the Hoodie table. Instants for the last 12 hours (configurable) is in the - * ActiveTimeline and the rest are in ArchivedTimeline. - * - * - * Instants are read from the archive file during initialization and never refreshed. To refresh, clients need to call - * reload() - * - * - * This class can be serialized and de-serialized and on de-serialization the FileSystem is re-initialized. + * Represents the Archived Timeline for the Hoodie table. + * + * After several instants are accumulated as a batch on the active timeline, they would be archived as a parquet file into the archived timeline. + * In general the archived timeline is comprised with parquet files with LSM style file layout. Each new operation to the archived timeline generates + * a new snapshot version. Theoretically, there could be multiple snapshot versions on the archived timeline. + * + * The Archived Timeline Layout + * + * + * t111, t112 ... t120 ... -> + * \ / + *\/ + *| + *V + * t111_t120_0.parquet, t101_t110_0.parquet,... t11_t20_0.parquetL0 + * \/ + * \ / + *| + *V + *t11_t100_1.parquetL1 + * + * manifest_1, manifest_2, ... manifest_12 + * | + * V + * _version_ + * + * + * The LSM Tree Compaction + * Use the universal compaction strategy, that is: when N(by default 10) number of parquet files exist in the current layer, they are merged and flush as a compacted file in the next layer. + * We have no limit for the layer number, assumes there are 10 instants for each file in L0, there could be 100 instants per file in L1, + * so 3000 instants could be represented as 3 parquets in L2, it is pretty fast if we use concurrent read. + * + * The benchmark shows 1000 instants read cost about 10 ms. Review Comment: done -- This is
[GitHub] [hudi] stream2000 commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
stream2000 commented on code in PR #9395: URL: https://github.com/apache/hudi/pull/9395#discussion_r1288224103 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java: ## @@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() { @Override public List map(List data, SerializableFunction func, int parallelism) { -return data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList()); +return stream(data, parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList()); } Review Comment: Hi @SteNicholas, correct me if I'm wrong. Do we just run the map function sequentially when the parallelism is larger than the default parallelism? And why there is an `OutOfMemoryError ` risk when we use `stream().parallel()` which will submit some future tasks to the default thread pool which may not cost a lot of memory? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores
stream2000 commented on PR #9395: URL: https://github.com/apache/hudi/pull/9395#issuecomment-1670984818 @SteNicholas Hi, sorry for the failure ci introduced. Now we can rebase the lastest master and test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf merged pull request #9401: [MINOR] Fix consistent hashing bucket index it failure
leesf merged PR #9401: URL: https://github.com/apache/hudi/pull/9401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Fix consistent hashing bucket index FT failure (#9401)
This is an automated email from the ASF dual-hosted git repository. leesf pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9b22583dbe0 [MINOR] Fix consistent hashing bucket index FT failure (#9401) 9b22583dbe0 is described below commit 9b22583dbe089df1c0014ee88f250a3e516667ce Author: StreamingFlames <18889897...@163.com> AuthorDate: Wed Aug 9 17:26:57 2023 +0800 [MINOR] Fix consistent hashing bucket index FT failure (#9401) --- .../org/apache/hudi/client/functional/TestConsistentBucketIndex.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java index 01b05f07642..b23259c1264 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java @@ -228,8 +228,8 @@ public class TestConsistentBucketIndex extends HoodieSparkClientTestHarness { Assertions.assertEquals(numFilesCreated, Arrays.stream(dataGen.getPartitionPaths()).mapToInt(p -> Objects.requireNonNull(listStatus(p, true)).length).sum()); -// BulkInsert again. -writeData(writeRecords, "002", WriteOperationType.BULK_INSERT,true); +// Upsert Data +writeData(writeRecords, "002", WriteOperationType.UPSERT,true); // The total number of file group should be the same, but each file group will have a log file. Assertions.assertEquals(numFilesCreated, Arrays.stream(dataGen.getPartitionPaths()).mapToInt(p -> Objects.requireNonNull(listStatus(p, true)).length).sum());
[GitHub] [hudi] leesf commented on pull request #9401: [MINOR] Fix consistent hashing bucket index it failure
leesf commented on PR #9401: URL: https://github.com/apache/hudi/pull/9401#issuecomment-1670980192 +1 as the FT spark-client passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288178605 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java: ## @@ -0,0 +1,382 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.client.utils; + +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieArchivedManifest; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieCommitException; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.hudi.io.storage.HoodieFileWriter; +import org.apache.hudi.io.storage.HoodieFileWriterFactory; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.marker.WriteMarkers; +import org.apache.hudi.table.marker.WriteMarkersFactory; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * An archived timeline writer which organizes the files as an LSM tree. + */ +public class ArchivedTimelineWriter { + private static final Logger LOG = LoggerFactory.getLogger(ArchivedTimelineWriter.class); + + private final HoodieWriteConfig config; + private final HoodieTable table; + private final HoodieTableMetaClient metaClient; + + private HoodieWriteConfig writeConfig; + + private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) { +this.config = config; +this.table = table; +this.metaClient = table.getMetaClient(); + } + + public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, HoodieTable table) { +return new ArchivedTimelineWriter(config, table); + } + + public void write(HoodieEngineContext context, List instants) throws HoodieCommitException { +Path filePath = new Path(metaClient.getArchivePath(), +newFileName(instants.get(0).getInstantTime(), instants.get(instants.size() - 1).getInstantTime(), HoodieArchivedTimeline.FILE_LAYER_ZERO)); +try (HoodieFileWriter writer = openWriter(filePath)) { + Schema wrapperSchema = HoodieArchivedInstant.getClassSchema(); + LOG.info("Archiving schema " + wrapperSchema.toString()); + for (ActiveInstant triple : instants) { +try { + deleteAnyLeftOverMarkers(context, triple); + // in local FS and HDFS, there could be empty completed instants due to crash. + final HoodieArchivedInstant metaEntry = MetadataConversionUtils.createArchivedInstant(triple, metaClient); + writer.write(metaEntry.getInstantTime(), new HoodieAvroIndexedRecord(metaEntry), wrapperSchema); +} catch (Exception e) { + LOG.error("Failed to archive instant: " + triple.getInstantTime(), e); + if (this.config.isFailOnTimelineArchivingEnabled()) { +throw e; + } +} + } + updateManifest(filePath.getName()); +} catch (Exception e) { + throw new HoodieCommitException("Failed to archive commits", e); +} + } + + public void updateManifest(String
[jira] [Comment Edited] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown
[ https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752332#comment-17752332 ] Xinglong Wang edited comment on HUDI-3425 at 8/9/23 8:57 AM: - {{I have encountered the same problem. I am using Flink on Yarn. When the job executes compaction but encounters an abnormal situation (for example, container is running beyond physical memory limits or other exceptions) and performs a full-restart, if `HoodieMergedLogRecordScanner` is still scanning log files at this time, and `ExternalSpillableMap#close()` is not executed to clean up, resulting in the accumulation of spillable map files in the /tmp directory, and eventually the disk is exhausted.}} {{Now I set `hoodie.memory.spillable.map.path` to the `$PWD/spillable-map/` directory when Yarn container launches, environment variable `PWD` is exported in `launch_container.sh`, so that the spillable map files will be cleaned up when the container is closed.}} was (Author: JIRAUSER295509): I have encountered the same problem. I am using Flink on Yarn. When the job executes compaction but encounters an abnormal situation (for example, container is running beyond physical memory limits or other exceptions) and performs a full-restart, if `HoodieMergedLogRecordScanner` is still scanning log files at this time, and `ExternalSpillableMap#close()` is not executed to clean up, resulting in the accumulation of spillable map files in the /tmp directory, and eventually the disk is exhausted. Now I set `hoodie.memory.spillable.map.path` to the `$PWD/spillable-map/` directory when Yarn container launches, environment variable `PWD` is exported in `launch_container.sh`, so that the spillable map files will be cleaned up when the container is closed. > Clean up spill path created by Hudi during uneventful shutdown > -- > > Key: HUDI-3425 > URL: https://issues.apache.org/jira/browse/HUDI-3425 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.12.1 > > > h1. Hudi spill path not getting cleared when containers getting killed > abruptly. > > When yarn kills the containers abruptly for any reason while hudi stage is in > progress then the spill path created by hudi on the disk is not cleaned and > as a result of which the nodes on the cluster start running out of space. We > need to clear the spill path manually to free out disk space. > > Ref issue: https://github.com/apache/hudi/issues/4771 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] aib628 commented on issue #8848: [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS
aib628 commented on issue #8848: URL: https://github.com/apache/hudi/issues/8848#issuecomment-1670938476 @danny0405 Yeah, i'm using hadoop3.1.0 + hive 3.1.2 package it from source, and deploy it using docker image of 'apachehudi/hudi-hadoop_3.1.0-hive_3.1.2:latest'. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288171581 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java: ## @@ -0,0 +1,382 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.client.utils; + +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieArchivedManifest; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieCommitException; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.hudi.io.storage.HoodieFileWriter; +import org.apache.hudi.io.storage.HoodieFileWriterFactory; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.marker.WriteMarkers; +import org.apache.hudi.table.marker.WriteMarkersFactory; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * An archived timeline writer which organizes the files as an LSM tree. + */ +public class ArchivedTimelineWriter { + private static final Logger LOG = LoggerFactory.getLogger(ArchivedTimelineWriter.class); + + private final HoodieWriteConfig config; + private final HoodieTable table; + private final HoodieTableMetaClient metaClient; + + private HoodieWriteConfig writeConfig; + + private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) { +this.config = config; +this.table = table; +this.metaClient = table.getMetaClient(); + } + + public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, HoodieTable table) { +return new ArchivedTimelineWriter(config, table); + } + + public void write(HoodieEngineContext context, List instants) throws HoodieCommitException { +Path filePath = new Path(metaClient.getArchivePath(), +newFileName(instants.get(0).getInstantTime(), instants.get(instants.size() - 1).getInstantTime(), HoodieArchivedTimeline.FILE_LAYER_ZERO)); +try (HoodieFileWriter writer = openWriter(filePath)) { + Schema wrapperSchema = HoodieArchivedInstant.getClassSchema(); + LOG.info("Archiving schema " + wrapperSchema.toString()); + for (ActiveInstant triple : instants) { +try { + deleteAnyLeftOverMarkers(context, triple); + // in local FS and HDFS, there could be empty completed instants due to crash. + final HoodieArchivedInstant metaEntry = MetadataConversionUtils.createArchivedInstant(triple, metaClient); + writer.write(metaEntry.getInstantTime(), new HoodieAvroIndexedRecord(metaEntry), wrapperSchema); +} catch (Exception e) { + LOG.error("Failed to archive instant: " + triple.getInstantTime(), e); + if (this.config.isFailOnTimelineArchivingEnabled()) { +throw e; + } +} + } + updateManifest(filePath.getName()); +} catch (Exception e) { + throw new HoodieCommitException("Failed to archive commits", e); +} + } + + public void updateManifest(String
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288170683 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java: ## @@ -0,0 +1,382 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.client.utils; + +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieArchivedManifest; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieCommitException; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.hudi.io.storage.HoodieFileWriter; +import org.apache.hudi.io.storage.HoodieFileWriterFactory; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.marker.WriteMarkers; +import org.apache.hudi.table.marker.WriteMarkersFactory; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * An archived timeline writer which organizes the files as an LSM tree. + */ +public class ArchivedTimelineWriter { + private static final Logger LOG = LoggerFactory.getLogger(ArchivedTimelineWriter.class); + + private final HoodieWriteConfig config; + private final HoodieTable table; + private final HoodieTableMetaClient metaClient; + + private HoodieWriteConfig writeConfig; + + private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) { +this.config = config; +this.table = table; +this.metaClient = table.getMetaClient(); + } + + public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, HoodieTable table) { +return new ArchivedTimelineWriter(config, table); + } + + public void write(HoodieEngineContext context, List instants) throws HoodieCommitException { +Path filePath = new Path(metaClient.getArchivePath(), +newFileName(instants.get(0).getInstantTime(), instants.get(instants.size() - 1).getInstantTime(), HoodieArchivedTimeline.FILE_LAYER_ZERO)); +try (HoodieFileWriter writer = openWriter(filePath)) { + Schema wrapperSchema = HoodieArchivedInstant.getClassSchema(); + LOG.info("Archiving schema " + wrapperSchema.toString()); + for (ActiveInstant triple : instants) { +try { + deleteAnyLeftOverMarkers(context, triple); + // in local FS and HDFS, there could be empty completed instants due to crash. + final HoodieArchivedInstant metaEntry = MetadataConversionUtils.createArchivedInstant(triple, metaClient); + writer.write(metaEntry.getInstantTime(), new HoodieAvroIndexedRecord(metaEntry), wrapperSchema); +} catch (Exception e) { + LOG.error("Failed to archive instant: " + triple.getInstantTime(), e); + if (this.config.isFailOnTimelineArchivingEnabled()) { +throw e; + } +} + } + updateManifest(filePath.getName()); +} catch (Exception e) { + throw new HoodieCommitException("Failed to archive commits", e); +} + } + + public void updateManifest(String
[jira] [Commented] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown
[ https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752332#comment-17752332 ] Xinglong Wang commented on HUDI-3425: - I have encountered the same problem. I am using Flink on Yarn. When the job executes compaction but encounters an abnormal situation (for example, container is running beyond physical memory limits or other exceptions) and performs a full-restart, if `HoodieMergedLogRecordScanner` is still scanning log files at this time, and `ExternalSpillableMap#close()` is not executed to clean up, resulting in the accumulation of spillable map files in the /tmp directory, and eventually the disk is exhausted. Now I set `hoodie.memory.spillable.map.path` to the `$PWD/spillable-map/` directory when Yarn container launches, environment variable `PWD` is exported in `launch_container.sh`, so that the spillable map files will be cleaned up when the container is closed. > Clean up spill path created by Hudi during uneventful shutdown > -- > > Key: HUDI-3425 > URL: https://issues.apache.org/jira/browse/HUDI-3425 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.12.1 > > > h1. Hudi spill path not getting cleared when containers getting killed > abruptly. > > When yarn kills the containers abruptly for any reason while hudi stage is in > progress then the spill path created by hudi on the disk is not cleaned and > as a result of which the nodes on the cluster start running out of space. We > need to clear the spill path manually to free out disk space. > > Ref issue: https://github.com/apache/hudi/issues/4771 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288167338 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java: ## @@ -0,0 +1,382 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.client.utils; + +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieArchivedManifest; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieCommitException; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.hudi.io.storage.HoodieFileWriter; +import org.apache.hudi.io.storage.HoodieFileWriterFactory; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.marker.WriteMarkers; +import org.apache.hudi.table.marker.WriteMarkersFactory; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * An archived timeline writer which organizes the files as an LSM tree. + */ +public class ArchivedTimelineWriter { + private static final Logger LOG = LoggerFactory.getLogger(ArchivedTimelineWriter.class); + + private final HoodieWriteConfig config; + private final HoodieTable table; + private final HoodieTableMetaClient metaClient; + + private HoodieWriteConfig writeConfig; + + private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) { +this.config = config; +this.table = table; +this.metaClient = table.getMetaClient(); + } + + public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, HoodieTable table) { +return new ArchivedTimelineWriter(config, table); + } + + public void write(HoodieEngineContext context, List instants) throws HoodieCommitException { +Path filePath = new Path(metaClient.getArchivePath(), +newFileName(instants.get(0).getInstantTime(), instants.get(instants.size() - 1).getInstantTime(), HoodieArchivedTimeline.FILE_LAYER_ZERO)); +try (HoodieFileWriter writer = openWriter(filePath)) { + Schema wrapperSchema = HoodieArchivedInstant.getClassSchema(); + LOG.info("Archiving schema " + wrapperSchema.toString()); + for (ActiveInstant triple : instants) { +try { + deleteAnyLeftOverMarkers(context, triple); + // in local FS and HDFS, there could be empty completed instants due to crash. + final HoodieArchivedInstant metaEntry = MetadataConversionUtils.createArchivedInstant(triple, metaClient); + writer.write(metaEntry.getInstantTime(), new HoodieAvroIndexedRecord(metaEntry), wrapperSchema); +} catch (Exception e) { + LOG.error("Failed to archive instant: " + triple.getInstantTime(), e); + if (this.config.isFailOnTimelineArchivingEnabled()) { +throw e; + } +} + } + updateManifest(filePath.getName()); +} catch (Exception e) { + throw new HoodieCommitException("Failed to archive commits", e); +} + } + + public void updateManifest(String
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288155900 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java: ## @@ -0,0 +1,382 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.client.utils; + +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieArchivedManifest; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieCommitException; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.hudi.io.storage.HoodieFileWriter; +import org.apache.hudi.io.storage.HoodieFileWriterFactory; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.marker.WriteMarkers; +import org.apache.hudi.table.marker.WriteMarkersFactory; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * An archived timeline writer which organizes the files as an LSM tree. + */ +public class ArchivedTimelineWriter { + private static final Logger LOG = LoggerFactory.getLogger(ArchivedTimelineWriter.class); + + private final HoodieWriteConfig config; + private final HoodieTable table; + private final HoodieTableMetaClient metaClient; + + private HoodieWriteConfig writeConfig; + + private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) { +this.config = config; +this.table = table; +this.metaClient = table.getMetaClient(); + } + + public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, HoodieTable table) { +return new ArchivedTimelineWriter(config, table); + } + + public void write(HoodieEngineContext context, List instants) throws HoodieCommitException { +Path filePath = new Path(metaClient.getArchivePath(), +newFileName(instants.get(0).getInstantTime(), instants.get(instants.size() - 1).getInstantTime(), HoodieArchivedTimeline.FILE_LAYER_ZERO)); +try (HoodieFileWriter writer = openWriter(filePath)) { + Schema wrapperSchema = HoodieArchivedInstant.getClassSchema(); + LOG.info("Archiving schema " + wrapperSchema.toString()); + for (ActiveInstant triple : instants) { +try { + deleteAnyLeftOverMarkers(context, triple); + // in local FS and HDFS, there could be empty completed instants due to crash. + final HoodieArchivedInstant metaEntry = MetadataConversionUtils.createArchivedInstant(triple, metaClient); + writer.write(metaEntry.getInstantTime(), new HoodieAvroIndexedRecord(metaEntry), wrapperSchema); +} catch (Exception e) { + LOG.error("Failed to archive instant: " + triple.getInstantTime(), e); + if (this.config.isFailOnTimelineArchivingEnabled()) { +throw e; + } +} + } + updateManifest(filePath.getName()); +} catch (Exception e) { + throw new HoodieCommitException("Failed to archive commits", e); +} + } + + public void updateManifest(String
[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline
danny0405 commented on code in PR #9209: URL: https://github.com/apache/hudi/pull/9209#discussion_r1288155900 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java: ## @@ -0,0 +1,382 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.client.utils; + +import org.apache.hudi.avro.model.HoodieArchivedInstant; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieArchivedManifest; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieCommitException; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; +import org.apache.hudi.io.storage.HoodieFileReaderFactory; +import org.apache.hudi.io.storage.HoodieFileWriter; +import org.apache.hudi.io.storage.HoodieFileWriterFactory; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.marker.WriteMarkers; +import org.apache.hudi.table.marker.WriteMarkersFactory; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * An archived timeline writer which organizes the files as an LSM tree. + */ +public class ArchivedTimelineWriter { + private static final Logger LOG = LoggerFactory.getLogger(ArchivedTimelineWriter.class); + + private final HoodieWriteConfig config; + private final HoodieTable table; + private final HoodieTableMetaClient metaClient; + + private HoodieWriteConfig writeConfig; + + private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) { +this.config = config; +this.table = table; +this.metaClient = table.getMetaClient(); + } + + public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, HoodieTable table) { +return new ArchivedTimelineWriter(config, table); + } + + public void write(HoodieEngineContext context, List instants) throws HoodieCommitException { +Path filePath = new Path(metaClient.getArchivePath(), +newFileName(instants.get(0).getInstantTime(), instants.get(instants.size() - 1).getInstantTime(), HoodieArchivedTimeline.FILE_LAYER_ZERO)); +try (HoodieFileWriter writer = openWriter(filePath)) { + Schema wrapperSchema = HoodieArchivedInstant.getClassSchema(); + LOG.info("Archiving schema " + wrapperSchema.toString()); + for (ActiveInstant triple : instants) { +try { + deleteAnyLeftOverMarkers(context, triple); + // in local FS and HDFS, there could be empty completed instants due to crash. + final HoodieArchivedInstant metaEntry = MetadataConversionUtils.createArchivedInstant(triple, metaClient); + writer.write(metaEntry.getInstantTime(), new HoodieAvroIndexedRecord(metaEntry), wrapperSchema); +} catch (Exception e) { + LOG.error("Failed to archive instant: " + triple.getInstantTime(), e); + if (this.config.isFailOnTimelineArchivingEnabled()) { +throw e; + } +} + } + updateManifest(filePath.getName()); +} catch (Exception e) { + throw new HoodieCommitException("Failed to archive commits", e); +} + } + + public void updateManifest(String
[GitHub] [hudi] bhasudha commented on pull request #9406: [DOCS] Update Metadata table and metadata indexing related pages
bhasudha commented on PR #9406: URL: https://github.com/apache/hudi/pull/9406#issuecomment-1670918525 @codope Just FYI. This PR can be reviewed but must be merged after this [PR](https://github.com/apache/hudi/pull/9372) is merged for dependency on page links. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org