[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672596213

   
   ## CI report:
   
   * 89c387c5edc9044786899bc1288e35121df600f9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19235)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zlinsc commented on issue #9319: [SUPPORT] how to use HiveSyncConfig instead of hive configs in DataSourceWriteOptions object

2023-08-09 Thread via GitHub


zlinsc commented on issue #9319:
URL: https://github.com/apache/hudi/issues/9319#issuecomment-1672579627

   > @zlinsc You can use META_SYNC_DATABASE_NAME and META_SYNC_TABLE_NAME from 
HoodieSyncConfig.
   
   Whether HoodieSyncConfig will replace all the configs in the future? I found 
that it troubles me to find the correct variable sometimes and I have to search 
in other git codes. I hope hudi will unify all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9412:
URL: https://github.com/apache/hudi/pull/9412#issuecomment-1672559947

   
   ## CI report:
   
   * 24c43e61e9a304224df2ca5e2001551974348671 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19237)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9412:
URL: https://github.com/apache/hudi/pull/9412#issuecomment-1672553728

   
   ## CI report:
   
   * 24c43e61e9a304224df2ca5e2001551974348671 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9412:
URL: https://github.com/apache/hudi/pull/9412#discussion_r1289540657


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableLikeCommand.scala:
##
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command
+
+import org.apache.hudi.SparkAdapterSupport
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.common.util.ConfigUtils
+import org.apache.spark.sql.{AnalysisException, Row, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, 
CatalogTable, CatalogTableType, HoodieCatalogTable}
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.hudi.HoodieOptionConfig
+
+import scala.util.control.NonFatal
+
+case class CreateHoodieTableLikeCommand(targetTable: TableIdentifier,
+sourceTable: TableIdentifier,
+fileFormat: CatalogStorageFormat,
+properties: Map[String, String] = 
Map.empty,
+ignoreIfExists: Boolean)
+  extends HoodieLeafRunnableCommand with SparkAdapterSupport {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+
+val tableIsExists = catalog.tableExists(targetTable)
+if (tableIsExists) {
+  if (ignoreIfExists) {
+// scalastyle:off
+return Seq.empty[Row]
+// scalastyle:on
+  } else {
+throw new IllegalArgumentException(s"Table $targetTable already 
exists.")
+  }
+}
+
+val sourceTableDesc = 
catalog.getTempViewOrPermanentTableMetadata(sourceTable)
+
+val newStorage = if (fileFormat.inputFormat.isDefined) {
+  fileFormat
+} else {
+  sourceTableDesc.storage.copy(locationUri = fileFormat.locationUri)
+}
+
+// If the location is specified, we create an external table internally.
+// Otherwise create a managed table.
+val tblType = if (newStorage.locationUri.isEmpty) {
+  CatalogTableType.MANAGED
+} else {
+  CatalogTableType.EXTERNAL
+}
+
+val targetTableProperties = if 
(sparkAdapter.isHoodieTable(sourceTableDesc)) {
+  HoodieOptionConfig.extractHoodieOptions(sourceTableDesc.properties) ++ 
properties
+} else {
+  properties
+}
+
+val newTableSchema = CharVarcharUtils.getRawSchema(sourceTableDesc.schema)
+val newTableDesc = CatalogTable(
+  identifier = targetTable,
+  tableType = tblType,
+  storage = newStorage,
+  schema = newTableSchema,
+  provider = Some("hudi"),
+  partitionColumnNames = sourceTableDesc.partitionColumnNames,
+  bucketSpec = sourceTableDesc.bucketSpec,
+  properties = targetTableProperties,
+  tracksPartitionsInCatalog = sourceTableDesc.tracksPartitionsInCatalog)
+
+val hoodieCatalogTable = HoodieCatalogTable(sparkSession, newTableDesc)
+// check if there are conflict between table configs defined in hoodie 
table and properties defined in catalog.
+CreateHoodieTableCommand.validateTblProperties(hoodieCatalogTable)
+
+val queryAsProp = 
hoodieCatalogTable.catalogProperties.get(ConfigUtils.IS_QUERY_AS_RO_TABLE)
+if (queryAsProp.isEmpty) {

Review Comment:
   Does use specify this option through sql options?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9407:
URL: https://github.com/apache/hudi/pull/9407#issuecomment-1672526657

   
   ## CI report:
   
   * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228)
 
   * a4d6304bd5173e950413dc5d65a3d04be1144303 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19236)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6676) Add command for CreateHoodieTableLike

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6676:
-
Labels: pull-request-available  (was: )

> Add command for CreateHoodieTableLike
> -
>
> Key: HUDI-6676
> URL: https://issues.apache.org/jira/browse/HUDI-6676
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
>
> 1. Create table from non-hudi table
> 2. Create table from hudi table(The properties related to Hudi in the source 
> Hudi table will be carried over)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] boneanxs opened a new pull request, #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-09 Thread via GitHub


boneanxs opened a new pull request, #9412:
URL: https://github.com/apache/hudi/pull/9412

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   1. Create table from non-hudi table
   2. Create table from hudi table(The properties related to Hudi in the source 
Hudi table will be carried over)
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   None
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wecharyu commented on a diff in pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


wecharyu commented on code in PR #9408:
URL: https://github.com/apache/hudi/pull/9408#discussion_r1289521158


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala:
##
@@ -330,23 +330,36 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport {
 val allPartitionPaths = hoodieCatalogTable.getPartitionPaths
 val enableHiveStylePartitioning = 
isHiveStyledPartitioning(allPartitionPaths, table)
 val enableEncodeUrl = isUrlEncodeEnabled(allPartitionPaths, table)
-val partitionsToDrop = normalizedSpecs.map { spec =>
-  hoodieCatalogTable.partitionFields.map { partitionColumn =>
-val encodedPartitionValue = if (enableEncodeUrl) {
-  PartitionPathEncodeUtils.escapePathName(spec(partitionColumn))
-} else {
-  spec(partitionColumn)
-}
-if (enableHiveStylePartitioning) {
-  partitionColumn + "=" + encodedPartitionValue
-} else {

Review Comment:
   Not remove it, just reuse the common implementation of `makePartitionPath` 
which will handle hive style and url encode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9407:
URL: https://github.com/apache/hudi/pull/9407#issuecomment-1672520993

   
   ## CI report:
   
   * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228)
 
   * a4d6304bd5173e950413dc5d65a3d04be1144303 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9411:
URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672514880

   
   ## CI report:
   
   * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19234)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672514851

   
   ## CI report:
   
   * 65e9f9828da86e4558b1830493ead64366e69fae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229)
 
   * 89c387c5edc9044786899bc1288e35121df600f9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19235)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6676) Add command for CreateHoodieTableLike

2023-08-09 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An updated HUDI-6676:
-
Description: 
1. Create table from non-hudi table
2. Create table from hudi table(The properties related to Hudi in the source 
Hudi table will be carried over)

> Add command for CreateHoodieTableLike
> -
>
> Key: HUDI-6676
> URL: https://issues.apache.org/jira/browse/HUDI-6676
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>
> 1. Create table from non-hudi table
> 2. Create table from hudi table(The properties related to Hudi in the source 
> Hudi table will be carried over)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9407:
URL: https://github.com/apache/hudi/pull/9407#discussion_r1289510346


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java:
##
@@ -196,11 +196,11 @@ public void 
waitTillPendingAsyncServiceInstantsReducesTo(int numPending) throws
   }
 
   /**
-   * Enqueues new pending clustering instant.
+   * Enqueues new pending compaction/clustering instant.

Review Comment:
   ```suggestion
  * Enqueues new pending table service instant.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9407:
URL: https://github.com/apache/hudi/pull/9407#discussion_r1289510239


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java:
##
@@ -196,11 +196,11 @@ public void 
waitTillPendingAsyncServiceInstantsReducesTo(int numPending) throws
   }
 
   /**
-   * Enqueues new pending clustering instant.
+   * Enqueues new pending compaction/clustering instant.
* @param instant {@link HoodieInstant} to enqueue.
*/
   public void enqueuePendingAsyncServiceInstant(HoodieInstant instant) {
-LOG.info("Enqueuing new pending clustering instant: " + 
instant.getTimestamp());
+LOG.info("Enqueuing new pending compaction/clustering instant: " + 
instant.getTimestamp());

Review Comment:
   ```suggestion
   LOG.info("Enqueuing new pending table service instant: " + 
instant.getTimestamp());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6676) Add command for CreateHoodieTableLike

2023-08-09 Thread Hui An (Jira)
Hui An created HUDI-6676:


 Summary: Add command for CreateHoodieTableLike
 Key: HUDI-6676
 URL: https://issues.apache.org/jira/browse/HUDI-6676
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Hui An
Assignee: Hui An






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #9403: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9403:
URL: https://github.com/apache/hudi/pull/9403#discussion_r1289505376


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java:
##
@@ -54,21 +56,23 @@ public static boolean shouldAddOffsets(TypedProperties 
props) {
   public static final String KAFKA_SOURCE_OFFSET_COLUMN = 
"_hoodie_kafka_source_offset";
   public static final String KAFKA_SOURCE_PARTITION_COLUMN = 
"_hoodie_kafka_source_partition";
   public static final String KAFKA_SOURCE_TIMESTAMP_COLUMN = 
"_hoodie_kafka_source_timestamp";
+  public static final String KAFKA_SOURCE_KEY_COLUMN = 
"_hoodie_kafka_source_key";
 
   public KafkaOffsetPostProcessor(TypedProperties props, JavaSparkContext 
jssc) {
 super(props, jssc);
   }
 
   @Override
   public Schema processSchema(Schema schema) {
-// this method adds kafka offset fields namely source offset, partition 
and timestamp to the schema of the batch.
+// this method adds kafka offset fields namely source offset, partition, 
timestamp and kafka message key to the schema of the batch.
 try {
   List fieldList = schema.getFields();
   List newFieldList = fieldList.stream()
   .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), 
f.defaultVal())).collect(Collectors.toList());
   newFieldList.add(new Schema.Field(KAFKA_SOURCE_OFFSET_COLUMN, 
Schema.create(Schema.Type.LONG), "offset column", 0));
   newFieldList.add(new Schema.Field(KAFKA_SOURCE_PARTITION_COLUMN, 
Schema.create(Schema.Type.INT), "partition column", 0));
   newFieldList.add(new Schema.Field(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
Schema.create(Schema.Type.LONG), "timestamp column", 0));
+  newFieldList.add(new Schema.Field(KAFKA_SOURCE_KEY_COLUMN, 
createNullableSchema(Schema.Type.STRING), "kafka key column", 
JsonProperties.NULL_VALUE));
   Schema newSchema = Schema.createRecord(schema.getName() + "_processed", 
schema.getDoc(), schema.getNamespace(), false, newFieldList);

Review Comment:
   The key is always a string type? Could it be bytes in Kafka ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9403: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9403:
URL: https://github.com/apache/hudi/pull/9403#discussion_r1289505153


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java:
##
@@ -175,9 +176,12 @@ public GenericRecord 
withKafkaFieldsAppended(ConsumerRecord consumerRecord) {
 for (Schema.Field field :  record.getSchema().getFields()) {
   recordBuilder.set(field, record.get(field.name()));
 }
+
+String kafkaKey = String.valueOf(consumerRecord.key());
 recordBuilder.set(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset());
 recordBuilder.set(KAFKA_SOURCE_PARTITION_COLUMN, 
consumerRecord.partition());
 recordBuilder.set(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
consumerRecord.timestamp());
+recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, kafkaKey);

Review Comment:
   ```suggestion
   recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, 
String.valueOf(consumerRecord.key()));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9408:
URL: https://github.com/apache/hudi/pull/9408#discussion_r1289497305


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala:
##
@@ -330,23 +330,36 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport {
 val allPartitionPaths = hoodieCatalogTable.getPartitionPaths
 val enableHiveStylePartitioning = 
isHiveStyledPartitioning(allPartitionPaths, table)
 val enableEncodeUrl = isUrlEncodeEnabled(allPartitionPaths, table)
-val partitionsToDrop = normalizedSpecs.map { spec =>
-  hoodieCatalogTable.partitionFields.map { partitionColumn =>
-val encodedPartitionValue = if (enableEncodeUrl) {
-  PartitionPathEncodeUtils.escapePathName(spec(partitionColumn))
-} else {
-  spec(partitionColumn)
-}
-if (enableHiveStylePartitioning) {
-  partitionColumn + "=" + encodedPartitionValue
-} else {

Review Comment:
   Why removing the handling of hive style partitioning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #4974: [HUDI-3494] Consider triggering condition of MOR compaction during archival

2023-08-09 Thread via GitHub


Zouxxyy commented on PR #4974:
URL: https://github.com/apache/hudi/pull/4974#issuecomment-1672490060

   - The default triggering condition is the number of delta commits, with the 
config of hoodie.compact.inline.max.delta.commits. If this setting is larger 
than the archival config of hoodie.keep.max.commits, there is not enough delta 
commits in the active timeline and the compaction will never happen.
   
   why not just throw exception when `hoodie.compact.inline.max.delta.commits` 
> `hoodie.keep.max.commits`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672487101

   
   ## CI report:
   
   * 65e9f9828da86e4558b1830493ead64366e69fae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229)
 
   * 89c387c5edc9044786899bc1288e35121df600f9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6675) Clean action will delete the whole table

2023-08-09 Thread sanqingleo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sanqingleo updated HUDI-6675:
-
Summary: Clean action will delete the whole table  (was: InsertOverwrite 
will delete the whole table)

> Clean action will delete the whole table
> 
>
> Key: HUDI-6675
> URL: https://issues.apache.org/jira/browse/HUDI-6675
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Affects Versions: 0.11.1, 0.13.0
> Environment: hudi 0.11 both 0.13.
> spark 3.4
>Reporter: sanqingleo
>Priority: Major
> Attachments: image-2023-08-10-10-35-02-798.png, 
> image-2023-08-10-10-37-05-339.png
>
>
> h1. Abstract
> when I use inset_overwrite feature both in spark sql and api, It's will clean 
> the whole table when it's not partition table
> then throw this exception
> !image-2023-08-10-10-37-05-339.png!
> h1. Version
>  # hudi 0.11 both 0.13.
>  # spark 3.4
> h1. Bug Position
> org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult
> !image-2023-08-10-10-35-02-798.png!
> h1. How to recurrent
> Need to run 4 times, fourth  time will trigger clean action.
> 0.11, both sql and api
> 0.13 just api
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
> import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
> object InsertOverwriteTest {
>   def main(array: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .appName("TestInsertOverwrite")
>   .master("local[4]")
>   .config("spark.sql.extensions", 
> "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .config("spark.sql.catalog.spark_catalog" 
> ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
>   .getOrCreate()
> spark.conf.set("hoodie.index.type", "BUCKET")
> spark.conf.set("hoodie.storage.layout.type", "BUCKET")
> spark.conf.set("HADOOP_USER_NAME", "parallels")
> System.setProperty("HADOOP_USER_NAME", "parallels")
> var seq = List(
>   Row("uuid_01", "27", "2022-09-23", "par_01"),
>   Row("uuid_02", "21", "2022-09-23", "par_02"),
>   Row("uuid_03", "23", "2022-09-23", "par_04"),
>   Row("uuid_04", "24", "2022-09-23", "par_02"),
>   Row("uuid_05", "26", "2022-09-23", "par_01"),
>   Row("uuid_06", "20", "2022-09-23", "par_03"),
> )
> var rdd = spark.sparkContext.parallelize(seq)
> var structType: StructType = StructType(Array(
>   StructField("uuid", DataTypes.StringType, nullable = true),
>   StructField("age", DataTypes.StringType, nullable = true),
>   StructField("ts", DataTypes.StringType, nullable = true),
>   StructField("par", DataTypes.StringType, nullable = true)
> ))
> var df1 = spark.createDataFrame(rdd, structType)
>   .createOrReplaceTempView("compact_test_num")
> var df: DataFrame = spark.sql(" select uuid, age, ts, par from 
> compact_test_num limit 10")
> df.write.format("org.apache.hudi")
>   .option(RECORDKEY_FIELD.key, "uuid")
>   .option(PRECOMBINE_FIELD.key, "ts")
> //  .option(PARTITIONPATH_FIELD.key(), "par")
>   .option("hoodie.table.keygenerator.class", 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator")
>   .option(KEYGENERATOR_CLASS_NAME.key, 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator")
> //  .option(KEYGENERATOR_CLASS_NAME.key, 
> "org.apache.hudi.keygen.ComplexKeyGenerator")
>   .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL)
>   .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL)
>   .option("hoodie.metadata.enable", "false")
>   .option("hoodie.index.type", "BUCKET")
>   .option("hoodie.bucket.index.hash.field", "uuid")
>   .option("hoodie.bucket.index.num.buckets", "2")
>   .option("hoodie.storage.layout.type", "BUCKET")
>   .option("hoodie.storage.layout.partitioner.class", 
> "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner")
>   .option("hoodie.table.name", "cow_20230801_012")
>   .option("hoodie.upsert.shuffle.parallelism", "2")
>   .option("hoodie.insert.shuffle.parallelism", "2")
>   .option("hoodie.delete.shuffle.parallelism", "2")
>   .option("hoodie.clean.max.commits", "2")
>   .option("hoodie.cleaner.commits.retained", "2")
>   .option("hoodie.datasource.write.hive_style_partitioning", "true")
>   .mode(SaveMode.Append)
>   .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012")
>   }
> }
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-6675) InsertOverwrite will delete the whole table

2023-08-09 Thread sanqingleo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sanqingleo reopened HUDI-6675:
--

> InsertOverwrite will delete the whole table
> ---
>
> Key: HUDI-6675
> URL: https://issues.apache.org/jira/browse/HUDI-6675
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Affects Versions: 0.11.1, 0.13.0
> Environment: hudi 0.11 both 0.13.
> spark 3.4
>Reporter: sanqingleo
>Priority: Major
> Attachments: image-2023-08-10-10-35-02-798.png, 
> image-2023-08-10-10-37-05-339.png
>
>
> h1. Abstract
> when I use inset_overwrite feature both in spark sql and api, It's will clean 
> the whole table when it's not partition table
> then throw this exception
> !image-2023-08-10-10-37-05-339.png!
> h1. Version
>  # hudi 0.11 both 0.13.
>  # spark 3.4
> h1. Bug Position
> org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult
> !image-2023-08-10-10-35-02-798.png!
> h1. How to recurrent
> Need to run 4 times, fourth  time will trigger clean action.
> 0.11, both sql and api
> 0.13 just api
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
> import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
> object InsertOverwriteTest {
>   def main(array: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .appName("TestInsertOverwrite")
>   .master("local[4]")
>   .config("spark.sql.extensions", 
> "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .config("spark.sql.catalog.spark_catalog" 
> ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
>   .getOrCreate()
> spark.conf.set("hoodie.index.type", "BUCKET")
> spark.conf.set("hoodie.storage.layout.type", "BUCKET")
> spark.conf.set("HADOOP_USER_NAME", "parallels")
> System.setProperty("HADOOP_USER_NAME", "parallels")
> var seq = List(
>   Row("uuid_01", "27", "2022-09-23", "par_01"),
>   Row("uuid_02", "21", "2022-09-23", "par_02"),
>   Row("uuid_03", "23", "2022-09-23", "par_04"),
>   Row("uuid_04", "24", "2022-09-23", "par_02"),
>   Row("uuid_05", "26", "2022-09-23", "par_01"),
>   Row("uuid_06", "20", "2022-09-23", "par_03"),
> )
> var rdd = spark.sparkContext.parallelize(seq)
> var structType: StructType = StructType(Array(
>   StructField("uuid", DataTypes.StringType, nullable = true),
>   StructField("age", DataTypes.StringType, nullable = true),
>   StructField("ts", DataTypes.StringType, nullable = true),
>   StructField("par", DataTypes.StringType, nullable = true)
> ))
> var df1 = spark.createDataFrame(rdd, structType)
>   .createOrReplaceTempView("compact_test_num")
> var df: DataFrame = spark.sql(" select uuid, age, ts, par from 
> compact_test_num limit 10")
> df.write.format("org.apache.hudi")
>   .option(RECORDKEY_FIELD.key, "uuid")
>   .option(PRECOMBINE_FIELD.key, "ts")
> //  .option(PARTITIONPATH_FIELD.key(), "par")
>   .option("hoodie.table.keygenerator.class", 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator")
>   .option(KEYGENERATOR_CLASS_NAME.key, 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator")
> //  .option(KEYGENERATOR_CLASS_NAME.key, 
> "org.apache.hudi.keygen.ComplexKeyGenerator")
>   .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL)
>   .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL)
>   .option("hoodie.metadata.enable", "false")
>   .option("hoodie.index.type", "BUCKET")
>   .option("hoodie.bucket.index.hash.field", "uuid")
>   .option("hoodie.bucket.index.num.buckets", "2")
>   .option("hoodie.storage.layout.type", "BUCKET")
>   .option("hoodie.storage.layout.partitioner.class", 
> "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner")
>   .option("hoodie.table.name", "cow_20230801_012")
>   .option("hoodie.upsert.shuffle.parallelism", "2")
>   .option("hoodie.insert.shuffle.parallelism", "2")
>   .option("hoodie.delete.shuffle.parallelism", "2")
>   .option("hoodie.clean.max.commits", "2")
>   .option("hoodie.cleaner.commits.retained", "2")
>   .option("hoodie.datasource.write.hive_style_partitioning", "true")
>   .mode(SaveMode.Append)
>   .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012")
>   }
> }
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6675) InsertOverwrite will delete the whole table

2023-08-09 Thread sanqingleo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sanqingleo resolved HUDI-6675.
--

> InsertOverwrite will delete the whole table
> ---
>
> Key: HUDI-6675
> URL: https://issues.apache.org/jira/browse/HUDI-6675
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Affects Versions: 0.11.1, 0.13.0
> Environment: hudi 0.11 both 0.13.
> spark 3.4
>Reporter: sanqingleo
>Priority: Major
> Attachments: image-2023-08-10-10-35-02-798.png, 
> image-2023-08-10-10-37-05-339.png
>
>
> h1. Abstract
> when I use inset_overwrite feature both in spark sql and api, It's will clean 
> the whole table when it's not partition table
> then throw this exception
> !image-2023-08-10-10-37-05-339.png!
> h1. Version
>  # hudi 0.11 both 0.13.
>  # spark 3.4
> h1. Bug Position
> org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult
> !image-2023-08-10-10-35-02-798.png!
> h1. How to recurrent
> Need to run 4 times, fourth  time will trigger clean action.
> 0.11, both sql and api
> 0.13 just api
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
> import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
> object InsertOverwriteTest {
>   def main(array: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .appName("TestInsertOverwrite")
>   .master("local[4]")
>   .config("spark.sql.extensions", 
> "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .config("spark.sql.catalog.spark_catalog" 
> ,"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
>   .getOrCreate()
> spark.conf.set("hoodie.index.type", "BUCKET")
> spark.conf.set("hoodie.storage.layout.type", "BUCKET")
> spark.conf.set("HADOOP_USER_NAME", "parallels")
> System.setProperty("HADOOP_USER_NAME", "parallels")
> var seq = List(
>   Row("uuid_01", "27", "2022-09-23", "par_01"),
>   Row("uuid_02", "21", "2022-09-23", "par_02"),
>   Row("uuid_03", "23", "2022-09-23", "par_04"),
>   Row("uuid_04", "24", "2022-09-23", "par_02"),
>   Row("uuid_05", "26", "2022-09-23", "par_01"),
>   Row("uuid_06", "20", "2022-09-23", "par_03"),
> )
> var rdd = spark.sparkContext.parallelize(seq)
> var structType: StructType = StructType(Array(
>   StructField("uuid", DataTypes.StringType, nullable = true),
>   StructField("age", DataTypes.StringType, nullable = true),
>   StructField("ts", DataTypes.StringType, nullable = true),
>   StructField("par", DataTypes.StringType, nullable = true)
> ))
> var df1 = spark.createDataFrame(rdd, structType)
>   .createOrReplaceTempView("compact_test_num")
> var df: DataFrame = spark.sql(" select uuid, age, ts, par from 
> compact_test_num limit 10")
> df.write.format("org.apache.hudi")
>   .option(RECORDKEY_FIELD.key, "uuid")
>   .option(PRECOMBINE_FIELD.key, "ts")
> //  .option(PARTITIONPATH_FIELD.key(), "par")
>   .option("hoodie.table.keygenerator.class", 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator")
>   .option(KEYGENERATOR_CLASS_NAME.key, 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator")
> //  .option(KEYGENERATOR_CLASS_NAME.key, 
> "org.apache.hudi.keygen.ComplexKeyGenerator")
>   .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL)
>   .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL)
>   .option("hoodie.metadata.enable", "false")
>   .option("hoodie.index.type", "BUCKET")
>   .option("hoodie.bucket.index.hash.field", "uuid")
>   .option("hoodie.bucket.index.num.buckets", "2")
>   .option("hoodie.storage.layout.type", "BUCKET")
>   .option("hoodie.storage.layout.partitioner.class", 
> "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner")
>   .option("hoodie.table.name", "cow_20230801_012")
>   .option("hoodie.upsert.shuffle.parallelism", "2")
>   .option("hoodie.insert.shuffle.parallelism", "2")
>   .option("hoodie.delete.shuffle.parallelism", "2")
>   .option("hoodie.clean.max.commits", "2")
>   .option("hoodie.cleaner.commits.retained", "2")
>   .option("hoodie.datasource.write.hive_style_partitioning", "true")
>   .mode(SaveMode.Append)
>   .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012")
>   }
> }
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6675) InsertOverwrite will delete the whole table

2023-08-09 Thread sanqingleo (Jira)
sanqingleo created HUDI-6675:


 Summary: InsertOverwrite will delete the whole table
 Key: HUDI-6675
 URL: https://issues.apache.org/jira/browse/HUDI-6675
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Affects Versions: 0.13.0, 0.11.1
 Environment: hudi 0.11 both 0.13.
spark 3.4

Reporter: sanqingleo
 Attachments: image-2023-08-10-10-35-02-798.png, 
image-2023-08-10-10-37-05-339.png

h1. Abstract

when I use inset_overwrite feature both in spark sql and api, It's will clean 
the whole table when it's not partition table

then throw this exception

!image-2023-08-10-10-37-05-339.png!
h1. Version
 # hudi 0.11 both 0.13.
 # spark 3.4

h1. Bug Position

org.apache.hudi.table.action.clean.CleanActionExecutor#deleteFileAndGetResult
!image-2023-08-10-10-35-02-798.png!
h1. How to recurrent

Need to run 4 times, fourth  time will trigger clean action.

0.11, both sql and api

0.13 just api

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}

object InsertOverwriteTest {
  def main(array: Array[String]): Unit = {
val spark = SparkSession.builder()
  .appName("TestInsertOverwrite")
  .master("local[4]")
  .config("spark.sql.extensions", 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .config("spark.sql.catalog.spark_catalog" 
,"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
  .getOrCreate()

spark.conf.set("hoodie.index.type", "BUCKET")
spark.conf.set("hoodie.storage.layout.type", "BUCKET")
spark.conf.set("HADOOP_USER_NAME", "parallels")
System.setProperty("HADOOP_USER_NAME", "parallels")

var seq = List(
  Row("uuid_01", "27", "2022-09-23", "par_01"),
  Row("uuid_02", "21", "2022-09-23", "par_02"),
  Row("uuid_03", "23", "2022-09-23", "par_04"),
  Row("uuid_04", "24", "2022-09-23", "par_02"),
  Row("uuid_05", "26", "2022-09-23", "par_01"),
  Row("uuid_06", "20", "2022-09-23", "par_03"),
)

var rdd = spark.sparkContext.parallelize(seq)
var structType: StructType = StructType(Array(
  StructField("uuid", DataTypes.StringType, nullable = true),
  StructField("age", DataTypes.StringType, nullable = true),
  StructField("ts", DataTypes.StringType, nullable = true),
  StructField("par", DataTypes.StringType, nullable = true)
))

var df1 = spark.createDataFrame(rdd, structType)
  .createOrReplaceTempView("compact_test_num")

var df: DataFrame = spark.sql(" select uuid, age, ts, par from 
compact_test_num limit 10")

df.write.format("org.apache.hudi")
  .option(RECORDKEY_FIELD.key, "uuid")
  .option(PRECOMBINE_FIELD.key, "ts")
//  .option(PARTITIONPATH_FIELD.key(), "par")
  .option("hoodie.table.keygenerator.class", 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator")
  .option(KEYGENERATOR_CLASS_NAME.key, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator")
//  .option(KEYGENERATOR_CLASS_NAME.key, 
"org.apache.hudi.keygen.ComplexKeyGenerator")
  .option(OPERATION.key, INSERT_OVERWRITE_OPERATION_OPT_VAL)
  .option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL)
  .option("hoodie.metadata.enable", "false")
  .option("hoodie.index.type", "BUCKET")
  .option("hoodie.bucket.index.hash.field", "uuid")
  .option("hoodie.bucket.index.num.buckets", "2")
  .option("hoodie.storage.layout.type", "BUCKET")
  .option("hoodie.storage.layout.partitioner.class", 
"org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner")
  .option("hoodie.table.name", "cow_20230801_012")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option("hoodie.insert.shuffle.parallelism", "2")
  .option("hoodie.delete.shuffle.parallelism", "2")
  .option("hoodie.clean.max.commits", "2")
  .option("hoodie.cleaner.commits.retained", "2")
  .option("hoodie.datasource.write.hive_style_partitioning", "true")
  .mode(SaveMode.Append)
  .save("hdfs://bigdata01:9000/hudi_test/cow_20230801_012")
  }

}
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] someguyLi commented on issue #9363: [SUPPORT] Streaming query loss delete data

2023-08-09 Thread via GitHub


someguyLi commented on issue #9363:
URL: https://github.com/apache/hudi/issues/9363#issuecomment-1672439581

   > The Hudi table is used like a message queue, so TTL is a general solution 
for keepping the records aliveness. There is no good solution for this, for 
Kafka, they throws exception or allow the consumer to fallback to latest/oldest 
offset for recovering, but both of these means do not work very well for 
changelog, because any change loss would incur incorrect results.
   
   Thanks for your support, i will find another way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9411:
URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672423674

   
   ## CI report:
   
   * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19234)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8327: [HUDI-5361] Propagate all hoodie configs from spark sqlconf, but don't overwrite values already set

2023-08-09 Thread via GitHub


hudi-bot commented on PR #8327:
URL: https://github.com/apache/hudi/pull/8327#issuecomment-1672385859

   
   ## CI report:
   
   * 94e4c2e74c6170ceee8c303f7237bd10f2cd334f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19232)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9410:
URL: https://github.com/apache/hudi/pull/9410#issuecomment-1672381897

   
   ## CI report:
   
   * a3bd3418eccb373f200139996d34b8cc71913a62 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9409:
URL: https://github.com/apache/hudi/pull/9409#issuecomment-1672296128

   
   ## CI report:
   
   * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9411:
URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672282598

   
   ## CI report:
   
   * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19234)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8327: [HUDI-5361] Propagate all hoodie configs from spark sqlconf, but don't overwrite values already set

2023-08-09 Thread via GitHub


hudi-bot commented on PR #8327:
URL: https://github.com/apache/hudi/pull/8327#issuecomment-1672232135

   
   ## CI report:
   
   * b3388a3bb559227d2415e747681326f6109b4cc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15998)
 
   * 94e4c2e74c6170ceee8c303f7237bd10f2cd334f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19232)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] neeruks commented on issue #5348: [SUPPORT]org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220418194506064

2023-08-09 Thread via GitHub


neeruks commented on issue #5348:
URL: https://github.com/apache/hudi/issues/5348#issuecomment-1672230698

   I am also getting the same error. I am using Glue to read the CSV file and 
write it into a Hudi table.
   
   py4j.protocol.Py4JJavaError: An error occurred while calling o326.save.
   : org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for 
commit time 20230809204110303


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] neeruks commented on issue #2970: [SUPPORT] Failed to upsert for commit time

2023-08-09 Thread via GitHub


neeruks commented on issue #2970:
URL: https://github.com/apache/hudi/issues/2970#issuecomment-1672228563

   I am also getting the same error. I am using Glue to read the CSV file and 
write it into a Hudi table.
   
   py4j.protocol.Py4JJavaError: An error occurred while calling o326.save.
   : org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for 
commit time 20230809204110303
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9410:
URL: https://github.com/apache/hudi/pull/9410#issuecomment-1672223948

   
   ## CI report:
   
   * a3bd3418eccb373f200139996d34b8cc71913a62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9411:
URL: https://github.com/apache/hudi/pull/9411#issuecomment-1672224229

   
   ## CI report:
   
   * 6a8aa88016ab8c2b2cab779f45ac2ecd409f3742 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8327: [HUDI-5361] Propagate all hoodie configs from spark sqlconf, but don't overwrite values already set

2023-08-09 Thread via GitHub


hudi-bot commented on PR #8327:
URL: https://github.com/apache/hudi/pull/8327#issuecomment-1672221796

   
   ## CI report:
   
   * b3388a3bb559227d2415e747681326f6109b4cc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15998)
 
   * 94e4c2e74c6170ceee8c303f7237bd10f2cd334f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9410:
URL: https://github.com/apache/hudi/pull/9410#issuecomment-1672215254

   
   ## CI report:
   
   * a3bd3418eccb373f200139996d34b8cc71913a62 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6674) Add rollback info from metadata table in timeline commands

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6674:
-
Labels: pull-request-available  (was: )

> Add rollback info from metadata table in timeline commands
> --
>
> Key: HUDI-6674
> URL: https://issues.apache.org/jira/browse/HUDI-6674
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua opened a new pull request, #9411: [HUDI-6674] Add rollback info from metadata table in timeline commands

2023-08-09 Thread via GitHub


yihua opened a new pull request, #9411:
URL: https://github.com/apache/hudi/pull/9411

   ### Change Logs
   
   This PR adds the rollback information from the metadata table to the output 
of the timeline commands in Hudi CLI, given that metadata data table also 
encounters more rollbacks now.  To make the table concise, the rollback 
information is added to the "Action" (for data table) or "MT Action" (for 
metadata table) column, instead of having an independent column showing the 
information.
   
   Here's the new output:
   ```
   hudi:hoodie_table->timeline show active --limit 200 --show-time-seconds 
--show-rollback-info
   
╔═╤═══╤═══╤═══╤╤╤╗
   ║ No. │ Instant   │ Action│ State │ Requested  │ 
Inflight   │ Completed  ║
   ║ │   │   │   │ Time   │ 
Time   │ Time   ║
   
╠═╪═══╪═══╪═══╪╪╪╣
   ...
   
╟─┼───┼───┼───┼┼┼╢
   ║ 11  │ 20230807154601569 │ rollback  │ COMPLETED │ 08-07 08:46:03 │ 
08-07 08:46:03 │ 08-07 08:47:58 ║
   ║ │   │ Rolls back│   ││ 
   │║
   ║ │   │ 20230807154346625 │   ││ 
   │║
   
╟─┼───┼───┼───┼┼┼╢
   ║ 12  │ 20230807154947753 │ rollback  │ COMPLETED │ 08-07 08:49:49 │ 
08-07 08:49:49 │ 08-07 08:51:46 ║
   ║ │   │ Rolls back│   ││ 
   │║
   ║ │   │ 20230807154720087 │   ││ 
   │║
   
╟─┼───┼───┼───┼┼┼╢
   ║ 13  │ 20230807155105131 │ commit│ COMPLETED │ 08-07 08:51:47 │ 
08-07 08:54:29 │ 08-07 08:55:42 ║
   
╟─┼───┼───┼───┼┼┼╢
   ...
   hudi:hoodie_table->timeline show active --with-metadata-table --limit 200 
--show-time-seconds
   
╔═╤══╤═══╤═══╤╤╤╤══╤═══╤╤╤╗
   ║ No. │ Instant  │ Action│ State │ Requested 
 │ Inflight   │ Completed  │ MT   │ MT│ MT  
   │ MT │ MT ║
   ║ │  │   │   │ Time  
 │ Time   │ Time   │ Action   │ State │ 
Requested  │ Inflight   │ Completed  ║
   ║ │  │   │   │   
 │││  │   │ Time
   │ Time   │ Time   ║
   
╠═╪══╪═══╪═══╪╪╪╪══╪═══╪╪╪╣
   ...
   
╟─┼──┼───┼───┼┼┼┼──┼───┼┼┼╢
   ║ 66  │ 20230807155157772│ - │ - │ - 
 │ -  │ -  │ rollback │ COMPLETED │ 08-07 
08:51:59 │ 08-07 08:52:00 │ 08-07 08:52:01 ║
   ║ │  │   │   │   
 │││ Rolls back   │   │ 
   ││║
   ║ │  │   │   │   
 │││ 20230807154406919│   │ 
   ││║
   
╟─┼──┼───┼───┼┼┼┼──┼───┼┼┼╢
   ║ 67  │ 20230807155547486│ commit│ INFLIGHT  │ 08-07 
08:56:06 │ 08-07 08:58:27 │ -  │ -│ - │ 
-  │ -  │ -  ║
   ║ │  │ Rolled back by│   │   
 │││  │   │ 
   ││║
   ║ │  │ 20230807160141230 │   

[jira] [Updated] (HUDI-6674) Add rollback info from metadata table in timeline commands

2023-08-09 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6674:

Fix Version/s: 0.14.0

> Add rollback info from metadata table in timeline commands
> --
>
> Key: HUDI-6674
> URL: https://issues.apache.org/jira/browse/HUDI-6674
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6674) Add rollback info from metadata table in timeline commands

2023-08-09 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-6674:
---

 Summary: Add rollback info from metadata table in timeline commands
 Key: HUDI-6674
 URL: https://issues.apache.org/jira/browse/HUDI-6674
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6674) Add rollback info from metadata table in timeline commands

2023-08-09 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6674:
---

Assignee: Ethan Guo

> Add rollback info from metadata table in timeline commands
> --
>
> Key: HUDI-6674
> URL: https://issues.apache.org/jira/browse/HUDI-6674
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong

2023-08-09 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6673:
--
Status: In Progress  (was: Open)

> Spark SQL core flow test incremental query syntax is wrong
> --
>
> Key: HUDI-6673
> URL: https://issues.apache.org/jira/browse/HUDI-6673
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> missing the incremental format argument



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong

2023-08-09 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6673:
--
Status: Patch Available  (was: In Progress)

> Spark SQL core flow test incremental query syntax is wrong
> --
>
> Key: HUDI-6673
> URL: https://issues.apache.org/jira/browse/HUDI-6673
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> missing the incremental format argument



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #9410: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test

2023-08-09 Thread via GitHub


jonvex opened a new pull request, #9410:
URL: https://github.com/apache/hudi/pull/9410

   ### Change Logs
   
   Test runs now
   
   ### Impact
   
   Testing for release
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6673:
-
Labels: pull-request-available  (was: )

> Spark SQL core flow test incremental query syntax is wrong
> --
>
> Key: HUDI-6673
> URL: https://issues.apache.org/jira/browse/HUDI-6673
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> missing the incremental format argument



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6673) Spark SQL core flow test incremental query syntax is wrong

2023-08-09 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-6673:
-

 Summary: Spark SQL core flow test incremental query syntax is wrong
 Key: HUDI-6673
 URL: https://issues.apache.org/jira/browse/HUDI-6673
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql, tests-ci
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


missing the incremental format argument



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9409:
URL: https://github.com/apache/hudi/pull/9409#issuecomment-1672081146

   
   ## CI report:
   
   * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1672081089

   
   ## CI report:
   
   * 65e9f9828da86e4558b1830493ead64366e69fae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9409:
URL: https://github.com/apache/hudi/pull/9409#issuecomment-1672070399

   
   ## CI report:
   
   * d567d80ea610ed8eca248901d310bd40ae4bf8e5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9407:
URL: https://github.com/apache/hudi/pull/9407#issuecomment-1672058852

   
   ## CI report:
   
   * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] emkornfield commented on issue #9355: [SUPPORT] Problem while reading from BQ tables which are synced on Hudi table

2023-08-09 Thread via GitHub


emkornfield commented on issue #9355:
URL: https://github.com/apache/hudi/issues/9355#issuecomment-1672050574

   This sounds like the likely cause.  The solution that uses a view for 
compatibility with Hudi is inherently flawed.  Using the newly contributed 
[manifest 
file](https://cloud.google.com/bigquery/docs/query-open-table-format-using-manifest-files)
 approach is going to be more robust along several dimensions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6663) Investigate Bootstrap Performance

2023-08-09 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler reassigned HUDI-6663:
-

Assignee: Jonathan Vexler

> Investigate Bootstrap Performance
> -
>
> Key: HUDI-6663
> URL: https://issues.apache.org/jira/browse/HUDI-6663
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Bootstrap performance seems slow even though reader schemas look correct



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6663) Investigate Bootstrap Performance

2023-08-09 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6663:
--
Status: In Progress  (was: Open)

> Investigate Bootstrap Performance
> -
>
> Key: HUDI-6663
> URL: https://issues.apache.org/jira/browse/HUDI-6663
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Bootstrap performance seems slow even though reader schemas look correct



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6663) Investigate Bootstrap Performance

2023-08-09 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6663:
--
Status: Patch Available  (was: In Progress)

> Investigate Bootstrap Performance
> -
>
> Key: HUDI-6663
> URL: https://issues.apache.org/jira/browse/HUDI-6663
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Bootstrap performance seems slow even though reader schemas look correct



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-09 Thread via GitHub


jonvex opened a new pull request, #9409:
URL: https://github.com/apache/hudi/pull/9409

   ### Change Logs
   
   Remove the broadcast when sending the file slices.
   
   ### Impact
   
   1 TB tpcds bootstrap queries 1-14 performance gap between new file format 
and fast bootstrap went from 2.23x to 1.01x. Similar performance gains expected 
for MOR table where many file slices have log files
   
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6663) Investigate Bootstrap Performance

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6663:
-
Labels: pull-request-available  (was: )

> Investigate Bootstrap Performance
> -
>
> Key: HUDI-6663
> URL: https://issues.apache.org/jira/browse/HUDI-6663
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Bootstrap performance seems slow even though reader schemas look correct



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1671887323

   
   ## CI report:
   
   * 65e9f9828da86e4558b1830493ead64366e69fae Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19229)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on issue #9355: [SUPPORT] Problem while reading from BQ tables which are synced on Hudi table

2023-08-09 Thread via GitHub


the-other-tim-brown commented on issue #9355:
URL: https://github.com/apache/hudi/issues/9355#issuecomment-1671851547

   @ranjanankur I'm taking a look at this and tracking with the JIRA ticket 
here as well https://issues.apache.org/jira/browse/HUDI-6672 
   
   I've reached out to the Google Cloud to confirm that this is an issue with 
updating the manifest while a query is running. The solution I'm working on 
will version these manifests so we do not modify the file while a query is in 
flight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6672) BigQuery Sync updates while queries running cause failures

2023-08-09 Thread Timothy Brown (Jira)
Timothy Brown created HUDI-6672:
---

 Summary: BigQuery Sync updates while queries running cause failures
 Key: HUDI-6672
 URL: https://issues.apache.org/jira/browse/HUDI-6672
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Timothy Brown


Issue was reported by the user here: 
[https://github.com/apache/hudi/issues/9355]

 

It looks like we are updating the underlying manifest file while there is a 
query executing causing issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6672) BigQuery Sync updates while queries running cause failures

2023-08-09 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown reassigned HUDI-6672:
---

Assignee: Timothy Brown

> BigQuery Sync updates while queries running cause failures
> --
>
> Key: HUDI-6672
> URL: https://issues.apache.org/jira/browse/HUDI-6672
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>
> Issue was reported by the user here: 
> [https://github.com/apache/hudi/issues/9355]
>  
> It looks like we are updating the underlying manifest file while there is a 
> query executing causing issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1671830253

   
   ## CI report:
   
   * 65e9f9828da86e4558b1830493ead64366e69fae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9407:
URL: https://github.com/apache/hudi/pull/9407#issuecomment-1671830173

   
   ## CI report:
   
   * ce0c6dd5877e222dd64ce5ac6434d81168c08727 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19228)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9407:
URL: https://github.com/apache/hudi/pull/9407#issuecomment-1671817231

   
   ## CI report:
   
   * ce0c6dd5877e222dd64ce5ac6434d81168c08727 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6671) Support 'alter table add partition' sql

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6671:
-
Labels: pull-request-available  (was: )

> Support 'alter table add partition' sql
> ---
>
> Key: HUDI-6671
> URL: https://issues.apache.org/jira/browse/HUDI-6671
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hudi-utilities
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> Hoodie does not support 'add partition' sql now, so we can not get partitions 
> added by 'add partition' command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] wecharyu opened a new pull request, #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-09 Thread via GitHub


wecharyu opened a new pull request, #9408:
URL: https://github.com/apache/hudi/pull/9408

   ### Change Logs
   Hoodie does not support 'add partition' sql now, so we can not get 
partitions added by 'add partition' command.
   In this patch, we implement add partition in Hoodie side:
   1. add new command `AlterHoodieTableAddPartitionCommand`
   2. add new unit test `TestAlterTableAddPartition`
   
   
   ### Impact
   No
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6671) Support 'alter table add partition' sql

2023-08-09 Thread Wechar (Jira)
Wechar created HUDI-6671:


 Summary: Support 'alter table add partition' sql
 Key: HUDI-6671
 URL: https://issues.apache.org/jira/browse/HUDI-6671
 Project: Apache Hudi
  Issue Type: Bug
  Components: hudi-utilities
Reporter: Wechar


Hoodie does not support 'add partition' sql now, so we can not get partitions 
added by 'add partition' command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] empcl opened a new pull request, #9407: asyncService log prompt incomplete

2023-08-09 Thread via GitHub


empcl opened a new pull request, #9407:
URL: https://github.com/apache/hudi/pull/9407

   ### Change Logs
   
   asyncService log prompt incomplete
   
   ### Impact
   
   asyncService log prompt incomplete
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9395:
URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671580470

   
   ## CI report:
   
   * f20fe8b171dc78a61639c1eabd7c5e5b4bbac201 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19227)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] andreacfm commented on issue #9354: [SUPPORT] HoodieDeltaStreamer fails to load org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning

2023-08-09 Thread via GitHub


andreacfm commented on issue #9354:
URL: https://github.com/apache/hudi/issues/9354#issuecomment-1671565665

   @ad1happy2go when trying to compile for spark 3.3 I get this error:
   
   ```
   [ERROR] COMPILATION ERROR :
   [INFO] -
   [ERROR] cannot access 
org.apache.hadoop.shaded.org.apache.avro.reflect.Stringable
 class file for org.apache.hadoop.shaded.org.apache.avro.reflect.Stringable 
not found
   [ERROR] 
/Users/andrea/code/repos/hudi/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:[240,9]
 no suitable method found for 
collect(java.util.stream.Collector,capture#1
 of 
?,java.util.Map>>)
   method 
java.util.stream.Stream.collect(java.util.function.Supplier,java.util.function.BiConsumer,java.util.function.BiConsumer)
 is not applicable
 (cannot infer type-variable(s) R
   (actual and formal argument lists differ in length))
   method java.util.stream.Stream.collect(java.util.stream.Collector) is not 
applicable
 (cannot infer type-variable(s) R,A
   (argument mismatch; 
java.util.stream.Collector,capture#1
 of 
?,java.util.Map>>
 cannot be converted to java.util.stream.Collector))
   ```
   Command is
   ```
   mvn clean package -DskipTests -Dspark3.3
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


stream2000 commented on code in PR #9395:
URL: https://github.com/apache/hudi/pull/9395#discussion_r1288537503


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##
@@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() {
 
   @Override
   public  List map(List data, SerializableFunction func, int 
parallelism) {
-return 
data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList());
+return stream(data, 
parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList());
   }

Review Comment:
   > @stream2000, the parallelism of `stream().parallel()` is only 
`Runtime.getRuntime().availableProcessors()`
   
   So actually the parallelism of `availableProcessors` will cause OOM? If I 
configure the parallelism just as `Runtime.getRuntime().availableProcessors()` 
we still get OOM right? Correct me if I'm wrong~ 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9209:
URL: https://github.com/apache/hudi/pull/9209#issuecomment-1671364778

   
   ## CI report:
   
   * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN
   * 803df61d0d04f7e7403d1177325a365e9bbafab5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19226)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


SteNicholas commented on code in PR #9395:
URL: https://github.com/apache/hudi/pull/9395#discussion_r1288493882


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##
@@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() {
 
   @Override
   public  List map(List data, SerializableFunction func, int 
parallelism) {
-return 
data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList());
+return stream(data, 
parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList());
   }

Review Comment:
   @stream2000, the parallelism of `stream().parallel()` is only 
`Runtime.getRuntime().availableProcessors()`, then when the parallelism is 
1000, it will cause `OutOfMemoryError` by using `stream().parallel()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


SteNicholas commented on code in PR #9395:
URL: https://github.com/apache/hudi/pull/9395#discussion_r1288493882


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##
@@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() {
 
   @Override
   public  List map(List data, SerializableFunction func, int 
parallelism) {
-return 
data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList());
+return stream(data, 
parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList());
   }

Review Comment:
   @stream2000, the parallelism of `stream().parallel()` is only 
`Runtime.getRuntime().availableProcessors()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9395:
URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671265476

   
   ## CI report:
   
   * a60f7f89b5377119bf8bef6c7ddfd0dc821de1fc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19213)
 
   * f20fe8b171dc78a61639c1eabd7c5e5b4bbac201 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19227)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9395:
URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671200936

   
   ## CI report:
   
   * a60f7f89b5377119bf8bef6c7ddfd0dc821de1fc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19213)
 
   * f20fe8b171dc78a61639c1eabd7c5e5b4bbac201 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


stream2000 commented on code in PR #9395:
URL: https://github.com/apache/hudi/pull/9395#discussion_r1288371818


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##
@@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() {
 
   @Override
   public  List map(List data, SerializableFunction func, int 
parallelism) {
-return 
data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList());
+return stream(data, 
parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList());
   }

Review Comment:
   When the parallelism is 1000,  will we run it in 1000 parallelism or 
`Runtime.getRuntime().availableProcessors()`?  If we use just the parallelism 
as `Runtime.getRuntime().availableProcessors()`,  will it cause 
`OutOfMemoryError` since it it not actually a large parallelism?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9403: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9403:
URL: https://github.com/apache/hudi/pull/9403#issuecomment-1671187660

   
   ## CI report:
   
   * 55da0942b542c664e49c7ab9ca9698dfbf67968e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19224)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


SteNicholas commented on PR #9395:
URL: https://github.com/apache/hudi/pull/9395#issuecomment-1671137882

   @stream2000, thanks for the fix. I have rebased the lastest master branch. 
cc @danny0405.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


SteNicholas commented on code in PR #9395:
URL: https://github.com/apache/hudi/pull/9395#discussion_r1288326850


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##
@@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() {
 
   @Override
   public  List map(List data, SerializableFunction func, int 
parallelism) {
-return 
data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList());
+return stream(data, 
parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList());
   }

Review Comment:
   @stream2000, I don't think only run the map function sequentially. In order 
to reduce the OOM risk caused by parallelization, all functions should be 
handled in this way. For example, when the parallelism is 1000, the 
`CleanPlanActionExecutor` uses map function and may casue `OutOfMemoryError` 
for many filegroups in `stream().parallel()`, which stacktrace like above 
description. BTW, the parallel stream only improves a little performance, 
therefore this change doesn't destroy much performance improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9209:
URL: https://github.com/apache/hudi/pull/9209#issuecomment-1671129502

   
   ## CI report:
   
   * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN
   * 57c1b843608a9b63d143ead5dd5168613bb13969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19027)
 
   * 803df61d0d04f7e7403d1177325a365e9bbafab5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19226)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9209:
URL: https://github.com/apache/hudi/pull/9209#issuecomment-1671117746

   
   ## CI report:
   
   * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN
   * 57c1b843608a9b63d143ead5dd5168613bb13969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19027)
 
   * 803df61d0d04f7e7403d1177325a365e9bbafab5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9405: [HUDI-6670] Fix timeline check in metadata table validator

2023-08-09 Thread via GitHub


hudi-bot commented on PR #9405:
URL: https://github.com/apache/hudi/pull/9405#issuecomment-1671103541

   
   ## CI report:
   
   * fc027c28476d50737566c3b714a4d58c38c39ff9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19222)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288283343


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java:
##
@@ -18,75 +18,127 @@
 
 package org.apache.hudi.common.table.timeline;
 
-import org.apache.hudi.avro.HoodieAvroUtils;
-import org.apache.hudi.avro.model.HoodieArchivedMetaEntry;
-import org.apache.hudi.avro.model.HoodieMergeArchiveFilePlan;
-import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
-import org.apache.hudi.common.model.HoodieLogFile;
-import org.apache.hudi.common.model.HoodiePartitionMetadata;
-import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
 import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
-import org.apache.hudi.common.table.log.HoodieLogFormat;
-import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
-import org.apache.hudi.common.table.log.block.HoodieLogBlock;
-import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.ArchivedInstantReadSchemas;
 import org.apache.hudi.common.util.CollectionUtils;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.Option;
-import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
 
+import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
 import org.apache.avro.generic.IndexedRecord;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PathFilter;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
 
 import java.io.IOException;
 import java.io.Serializable;
+import java.nio.ByteBuffer;
 import java.nio.charset.StandardCharsets;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
-import java.util.Comparator;
-import java.util.HashMap;
-import java.util.HashSet;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;
-import java.util.Spliterator;
-import java.util.Spliterators;
+import java.util.concurrent.ConcurrentHashMap;
 import java.util.function.Function;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
-import java.util.stream.StreamSupport;
+import java.util.stream.Collectors;
 
 /**
- * Represents the Archived Timeline for the Hoodie table. Instants for the 
last 12 hours (configurable) is in the
- * ActiveTimeline and the rest are in ArchivedTimeline.
- * 
- * 
- * Instants are read from the archive file during initialization and never 
refreshed. To refresh, clients need to call
- * reload()
- * 
- * 
- * This class can be serialized and de-serialized and on de-serialization the 
FileSystem is re-initialized.
+ * Represents the Archived Timeline for the Hoodie table.
+ *
+ * After several instants are accumulated as a batch on the active 
timeline, they would be archived as a parquet file into the archived timeline.
+ * In general the archived timeline is comprised with parquet files with LSM 
style file layout. Each new operation to the archived timeline generates
+ * a new snapshot version. Theoretically, there could be multiple snapshot 
versions on the archived timeline.
+ *
+ * The Archived Timeline Layout
+ *
+ * 
+ *   t111, t112 ... t120 ... ->
+ * \  /
+ *\/
+ *|
+ *V
+ *   t111_t120_0.parquet, t101_t110_0.parquet,...  t11_t20_0.parquetL0
+ *  \/
+ * \  /
+ *|
+ *V
+ *t11_t100_1.parquetL1
+ *
+ *  manifest_1, manifest_2, ... manifest_12
+ *  |
+ *  V
+ *  _version_
+ * 
+ *
+ * The LSM Tree Compaction
+ * Use the universal compaction strategy, that is: when N(by default 10) 
number of parquet files exist in the current layer, they are merged and flush 
as a compacted file in the next layer.
+ * We have no limit for the layer number, assumes there are 10 instants for 
each file in L0, there could be 100 instants per file in L1,
+ * so 3000 instants could be represented as 3 parquets in L2, it is pretty 
fast if we use concurrent read.
+ *
+ * The benchmark shows 1000 instants read cost about 10 ms.

Review Comment:
   done



-- 
This is 

[GitHub] [hudi] stream2000 commented on a diff in pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


stream2000 commented on code in PR #9395:
URL: https://github.com/apache/hudi/pull/9395#discussion_r1288224103


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##
@@ -102,12 +102,12 @@ public RuntimeContext getRuntimeContext() {
 
   @Override
   public  List map(List data, SerializableFunction func, int 
parallelism) {
-return 
data.stream().parallel().map(throwingMapWrapper(func)).collect(Collectors.toList());
+return stream(data, 
parallelism).map(throwingMapWrapper(func)).collect(Collectors.toList());
   }

Review Comment:
   Hi @SteNicholas,  correct me if I'm wrong.  Do we just run the map function 
sequentially when the parallelism is larger than the default parallelism? And 
why there is an `OutOfMemoryError ` risk when we use  `stream().parallel()` 
which will submit some future tasks to the default thread pool which may not 
cost a lot of memory? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on pull request #9395: [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores

2023-08-09 Thread via GitHub


stream2000 commented on PR #9395:
URL: https://github.com/apache/hudi/pull/9395#issuecomment-1670984818

   @SteNicholas Hi, sorry for the failure ci introduced. Now we can rebase the 
lastest master and test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf merged pull request #9401: [MINOR] Fix consistent hashing bucket index it failure

2023-08-09 Thread via GitHub


leesf merged PR #9401:
URL: https://github.com/apache/hudi/pull/9401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Fix consistent hashing bucket index FT failure (#9401)

2023-08-09 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9b22583dbe0 [MINOR] Fix consistent hashing bucket index FT failure 
(#9401)
9b22583dbe0 is described below

commit 9b22583dbe089df1c0014ee88f250a3e516667ce
Author: StreamingFlames <18889897...@163.com>
AuthorDate: Wed Aug 9 17:26:57 2023 +0800

[MINOR] Fix consistent hashing bucket index FT failure (#9401)
---
 .../org/apache/hudi/client/functional/TestConsistentBucketIndex.java  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java
index 01b05f07642..b23259c1264 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java
@@ -228,8 +228,8 @@ public class TestConsistentBucketIndex extends 
HoodieSparkClientTestHarness {
 Assertions.assertEquals(numFilesCreated,
 Arrays.stream(dataGen.getPartitionPaths()).mapToInt(p -> 
Objects.requireNonNull(listStatus(p, true)).length).sum());
 
-// BulkInsert again.
-writeData(writeRecords, "002", WriteOperationType.BULK_INSERT,true);
+// Upsert Data
+writeData(writeRecords, "002", WriteOperationType.UPSERT,true);
 // The total number of file group should be the same, but each file group 
will have a log file.
 Assertions.assertEquals(numFilesCreated,
 Arrays.stream(dataGen.getPartitionPaths()).mapToInt(p -> 
Objects.requireNonNull(listStatus(p, true)).length).sum());



[GitHub] [hudi] leesf commented on pull request #9401: [MINOR] Fix consistent hashing bucket index it failure

2023-08-09 Thread via GitHub


leesf commented on PR #9401:
URL: https://github.com/apache/hudi/pull/9401#issuecomment-1670980192

   +1 as the FT spark-client passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288178605


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java:
##
@@ -0,0 +1,382 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client.utils;
+
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.io.storage.HoodieFileWriter;
+import org.apache.hudi.io.storage.HoodieFileWriterFactory;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.marker.WriteMarkers;
+import org.apache.hudi.table.marker.WriteMarkersFactory;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.
+ */
+public class ArchivedTimelineWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ArchivedTimelineWriter.class);
+
+  private final HoodieWriteConfig config;
+  private final HoodieTable table;
+  private final HoodieTableMetaClient metaClient;
+
+  private HoodieWriteConfig writeConfig;
+
+  private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) {
+this.config = config;
+this.table = table;
+this.metaClient = table.getMetaClient();
+  }
+
+  public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, 
HoodieTable table) {
+return new ArchivedTimelineWriter(config, table);
+  }
+
+  public void write(HoodieEngineContext context, List instants) 
throws HoodieCommitException {
+Path filePath = new Path(metaClient.getArchivePath(),
+newFileName(instants.get(0).getInstantTime(), 
instants.get(instants.size() - 1).getInstantTime(), 
HoodieArchivedTimeline.FILE_LAYER_ZERO));
+try (HoodieFileWriter writer = openWriter(filePath)) {
+  Schema wrapperSchema = HoodieArchivedInstant.getClassSchema();
+  LOG.info("Archiving schema " + wrapperSchema.toString());
+  for (ActiveInstant triple : instants) {
+try {
+  deleteAnyLeftOverMarkers(context, triple);
+  // in local FS and HDFS, there could be empty completed instants due 
to crash.
+  final HoodieArchivedInstant metaEntry = 
MetadataConversionUtils.createArchivedInstant(triple, metaClient);
+  writer.write(metaEntry.getInstantTime(), new 
HoodieAvroIndexedRecord(metaEntry), wrapperSchema);
+} catch (Exception e) {
+  LOG.error("Failed to archive instant: " + triple.getInstantTime(), 
e);
+  if (this.config.isFailOnTimelineArchivingEnabled()) {
+throw e;
+  }
+}
+  }
+  updateManifest(filePath.getName());
+} catch (Exception e) {
+  throw new HoodieCommitException("Failed to archive commits", e);
+}
+  }
+
+  public void updateManifest(String 

[jira] [Comment Edited] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown

2023-08-09 Thread Xinglong Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752332#comment-17752332
 ] 

Xinglong Wang edited comment on HUDI-3425 at 8/9/23 8:57 AM:
-

{{I have encountered the same problem. I am using Flink on Yarn. When the job 
executes compaction but encounters an abnormal situation (for example, 
container is running beyond physical memory limits or other exceptions) and 
performs a full-restart, if `HoodieMergedLogRecordScanner` is still scanning 
log files at this time, and `ExternalSpillableMap#close()` is not executed to 
clean up, resulting in the accumulation of spillable map files in the /tmp 
directory, and eventually the disk is exhausted.}}
{{Now I set `hoodie.memory.spillable.map.path` to the `$PWD/spillable-map/` 
directory when Yarn container launches, environment variable `PWD` is exported 
in `launch_container.sh`, so that the spillable map files will be cleaned up 
when the container is closed.}}


was (Author: JIRAUSER295509):
I have encountered the same problem. I am using Flink on Yarn. When the job 
executes compaction but encounters an abnormal situation (for example, 
container is running beyond physical memory limits or other exceptions) and 
performs a full-restart, if `HoodieMergedLogRecordScanner` is still scanning 
log files at this time, and `ExternalSpillableMap#close()` is not executed to 
clean up, resulting in the accumulation of spillable map files in the /tmp 
directory, and eventually the disk is exhausted.
Now I set `hoodie.memory.spillable.map.path` to the `$PWD/spillable-map/` 
directory when Yarn container launches, environment variable `PWD` is exported 
in `launch_container.sh`, so that the spillable map files will be cleaned up 
when the container is closed.

> Clean up spill path created by Hudi during uneventful shutdown
> --
>
> Key: HUDI-3425
> URL: https://issues.apache.org/jira/browse/HUDI-3425
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> h1. Hudi spill path not getting cleared when containers getting killed 
> abruptly. 
>  
> When yarn kills the containers abruptly for any reason while hudi stage is in 
> progress then the spill path created by hudi on the disk is not cleaned and 
> as a result of which the nodes on the cluster start running out of space. We 
> need to clear the spill path manually to free out disk space.
>  
> Ref issue: https://github.com/apache/hudi/issues/4771



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] aib628 commented on issue #8848: [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS

2023-08-09 Thread via GitHub


aib628 commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-1670938476

   @danny0405 
   Yeah, i'm using hadoop3.1.0 + hive 3.1.2 package it from source, and deploy 
it using docker image of 'apachehudi/hudi-hadoop_3.1.0-hive_3.1.2:latest'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288171581


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java:
##
@@ -0,0 +1,382 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client.utils;
+
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.io.storage.HoodieFileWriter;
+import org.apache.hudi.io.storage.HoodieFileWriterFactory;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.marker.WriteMarkers;
+import org.apache.hudi.table.marker.WriteMarkersFactory;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.
+ */
+public class ArchivedTimelineWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ArchivedTimelineWriter.class);
+
+  private final HoodieWriteConfig config;
+  private final HoodieTable table;
+  private final HoodieTableMetaClient metaClient;
+
+  private HoodieWriteConfig writeConfig;
+
+  private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) {
+this.config = config;
+this.table = table;
+this.metaClient = table.getMetaClient();
+  }
+
+  public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, 
HoodieTable table) {
+return new ArchivedTimelineWriter(config, table);
+  }
+
+  public void write(HoodieEngineContext context, List instants) 
throws HoodieCommitException {
+Path filePath = new Path(metaClient.getArchivePath(),
+newFileName(instants.get(0).getInstantTime(), 
instants.get(instants.size() - 1).getInstantTime(), 
HoodieArchivedTimeline.FILE_LAYER_ZERO));
+try (HoodieFileWriter writer = openWriter(filePath)) {
+  Schema wrapperSchema = HoodieArchivedInstant.getClassSchema();
+  LOG.info("Archiving schema " + wrapperSchema.toString());
+  for (ActiveInstant triple : instants) {
+try {
+  deleteAnyLeftOverMarkers(context, triple);
+  // in local FS and HDFS, there could be empty completed instants due 
to crash.
+  final HoodieArchivedInstant metaEntry = 
MetadataConversionUtils.createArchivedInstant(triple, metaClient);
+  writer.write(metaEntry.getInstantTime(), new 
HoodieAvroIndexedRecord(metaEntry), wrapperSchema);
+} catch (Exception e) {
+  LOG.error("Failed to archive instant: " + triple.getInstantTime(), 
e);
+  if (this.config.isFailOnTimelineArchivingEnabled()) {
+throw e;
+  }
+}
+  }
+  updateManifest(filePath.getName());
+} catch (Exception e) {
+  throw new HoodieCommitException("Failed to archive commits", e);
+}
+  }
+
+  public void updateManifest(String 

[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288170683


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java:
##
@@ -0,0 +1,382 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client.utils;
+
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.io.storage.HoodieFileWriter;
+import org.apache.hudi.io.storage.HoodieFileWriterFactory;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.marker.WriteMarkers;
+import org.apache.hudi.table.marker.WriteMarkersFactory;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.
+ */
+public class ArchivedTimelineWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ArchivedTimelineWriter.class);
+
+  private final HoodieWriteConfig config;
+  private final HoodieTable table;
+  private final HoodieTableMetaClient metaClient;
+
+  private HoodieWriteConfig writeConfig;
+
+  private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) {
+this.config = config;
+this.table = table;
+this.metaClient = table.getMetaClient();
+  }
+
+  public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, 
HoodieTable table) {
+return new ArchivedTimelineWriter(config, table);
+  }
+
+  public void write(HoodieEngineContext context, List instants) 
throws HoodieCommitException {
+Path filePath = new Path(metaClient.getArchivePath(),
+newFileName(instants.get(0).getInstantTime(), 
instants.get(instants.size() - 1).getInstantTime(), 
HoodieArchivedTimeline.FILE_LAYER_ZERO));
+try (HoodieFileWriter writer = openWriter(filePath)) {
+  Schema wrapperSchema = HoodieArchivedInstant.getClassSchema();
+  LOG.info("Archiving schema " + wrapperSchema.toString());
+  for (ActiveInstant triple : instants) {
+try {
+  deleteAnyLeftOverMarkers(context, triple);
+  // in local FS and HDFS, there could be empty completed instants due 
to crash.
+  final HoodieArchivedInstant metaEntry = 
MetadataConversionUtils.createArchivedInstant(triple, metaClient);
+  writer.write(metaEntry.getInstantTime(), new 
HoodieAvroIndexedRecord(metaEntry), wrapperSchema);
+} catch (Exception e) {
+  LOG.error("Failed to archive instant: " + triple.getInstantTime(), 
e);
+  if (this.config.isFailOnTimelineArchivingEnabled()) {
+throw e;
+  }
+}
+  }
+  updateManifest(filePath.getName());
+} catch (Exception e) {
+  throw new HoodieCommitException("Failed to archive commits", e);
+}
+  }
+
+  public void updateManifest(String 

[jira] [Commented] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown

2023-08-09 Thread Xinglong Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752332#comment-17752332
 ] 

Xinglong Wang commented on HUDI-3425:
-

I have encountered the same problem. I am using Flink on Yarn. When the job 
executes compaction but encounters an abnormal situation (for example, 
container is running beyond physical memory limits or other exceptions) and 
performs a full-restart, if `HoodieMergedLogRecordScanner` is still scanning 
log files at this time, and `ExternalSpillableMap#close()` is not executed to 
clean up, resulting in the accumulation of spillable map files in the /tmp 
directory, and eventually the disk is exhausted.
Now I set `hoodie.memory.spillable.map.path` to the `$PWD/spillable-map/` 
directory when Yarn container launches, environment variable `PWD` is exported 
in `launch_container.sh`, so that the spillable map files will be cleaned up 
when the container is closed.

> Clean up spill path created by Hudi during uneventful shutdown
> --
>
> Key: HUDI-3425
> URL: https://issues.apache.org/jira/browse/HUDI-3425
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> h1. Hudi spill path not getting cleared when containers getting killed 
> abruptly. 
>  
> When yarn kills the containers abruptly for any reason while hudi stage is in 
> progress then the spill path created by hudi on the disk is not cleaned and 
> as a result of which the nodes on the cluster start running out of space. We 
> need to clear the spill path manually to free out disk space.
>  
> Ref issue: https://github.com/apache/hudi/issues/4771



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288167338


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java:
##
@@ -0,0 +1,382 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client.utils;
+
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.io.storage.HoodieFileWriter;
+import org.apache.hudi.io.storage.HoodieFileWriterFactory;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.marker.WriteMarkers;
+import org.apache.hudi.table.marker.WriteMarkersFactory;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.
+ */
+public class ArchivedTimelineWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ArchivedTimelineWriter.class);
+
+  private final HoodieWriteConfig config;
+  private final HoodieTable table;
+  private final HoodieTableMetaClient metaClient;
+
+  private HoodieWriteConfig writeConfig;
+
+  private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) {
+this.config = config;
+this.table = table;
+this.metaClient = table.getMetaClient();
+  }
+
+  public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, 
HoodieTable table) {
+return new ArchivedTimelineWriter(config, table);
+  }
+
+  public void write(HoodieEngineContext context, List instants) 
throws HoodieCommitException {
+Path filePath = new Path(metaClient.getArchivePath(),
+newFileName(instants.get(0).getInstantTime(), 
instants.get(instants.size() - 1).getInstantTime(), 
HoodieArchivedTimeline.FILE_LAYER_ZERO));
+try (HoodieFileWriter writer = openWriter(filePath)) {
+  Schema wrapperSchema = HoodieArchivedInstant.getClassSchema();
+  LOG.info("Archiving schema " + wrapperSchema.toString());
+  for (ActiveInstant triple : instants) {
+try {
+  deleteAnyLeftOverMarkers(context, triple);
+  // in local FS and HDFS, there could be empty completed instants due 
to crash.
+  final HoodieArchivedInstant metaEntry = 
MetadataConversionUtils.createArchivedInstant(triple, metaClient);
+  writer.write(metaEntry.getInstantTime(), new 
HoodieAvroIndexedRecord(metaEntry), wrapperSchema);
+} catch (Exception e) {
+  LOG.error("Failed to archive instant: " + triple.getInstantTime(), 
e);
+  if (this.config.isFailOnTimelineArchivingEnabled()) {
+throw e;
+  }
+}
+  }
+  updateManifest(filePath.getName());
+} catch (Exception e) {
+  throw new HoodieCommitException("Failed to archive commits", e);
+}
+  }
+
+  public void updateManifest(String 

[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288155900


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java:
##
@@ -0,0 +1,382 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client.utils;
+
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.io.storage.HoodieFileWriter;
+import org.apache.hudi.io.storage.HoodieFileWriterFactory;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.marker.WriteMarkers;
+import org.apache.hudi.table.marker.WriteMarkersFactory;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.
+ */
+public class ArchivedTimelineWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ArchivedTimelineWriter.class);
+
+  private final HoodieWriteConfig config;
+  private final HoodieTable table;
+  private final HoodieTableMetaClient metaClient;
+
+  private HoodieWriteConfig writeConfig;
+
+  private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) {
+this.config = config;
+this.table = table;
+this.metaClient = table.getMetaClient();
+  }
+
+  public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, 
HoodieTable table) {
+return new ArchivedTimelineWriter(config, table);
+  }
+
+  public void write(HoodieEngineContext context, List instants) 
throws HoodieCommitException {
+Path filePath = new Path(metaClient.getArchivePath(),
+newFileName(instants.get(0).getInstantTime(), 
instants.get(instants.size() - 1).getInstantTime(), 
HoodieArchivedTimeline.FILE_LAYER_ZERO));
+try (HoodieFileWriter writer = openWriter(filePath)) {
+  Schema wrapperSchema = HoodieArchivedInstant.getClassSchema();
+  LOG.info("Archiving schema " + wrapperSchema.toString());
+  for (ActiveInstant triple : instants) {
+try {
+  deleteAnyLeftOverMarkers(context, triple);
+  // in local FS and HDFS, there could be empty completed instants due 
to crash.
+  final HoodieArchivedInstant metaEntry = 
MetadataConversionUtils.createArchivedInstant(triple, metaClient);
+  writer.write(metaEntry.getInstantTime(), new 
HoodieAvroIndexedRecord(metaEntry), wrapperSchema);
+} catch (Exception e) {
+  LOG.error("Failed to archive instant: " + triple.getInstantTime(), 
e);
+  if (this.config.isFailOnTimelineArchivingEnabled()) {
+throw e;
+  }
+}
+  }
+  updateManifest(filePath.getName());
+} catch (Exception e) {
+  throw new HoodieCommitException("Failed to archive commits", e);
+}
+  }
+
+  public void updateManifest(String 

[GitHub] [hudi] danny0405 commented on a diff in pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-09 Thread via GitHub


danny0405 commented on code in PR #9209:
URL: https://github.com/apache/hudi/pull/9209#discussion_r1288155900


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java:
##
@@ -0,0 +1,382 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client.utils;
+
+import org.apache.hudi.avro.model.HoodieArchivedInstant;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieArchivedManifest;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieAvroParquetReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.io.storage.HoodieFileWriter;
+import org.apache.hudi.io.storage.HoodieFileWriterFactory;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.marker.WriteMarkers;
+import org.apache.hudi.table.marker.WriteMarkersFactory;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.
+ */
+public class ArchivedTimelineWriter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ArchivedTimelineWriter.class);
+
+  private final HoodieWriteConfig config;
+  private final HoodieTable table;
+  private final HoodieTableMetaClient metaClient;
+
+  private HoodieWriteConfig writeConfig;
+
+  private ArchivedTimelineWriter(HoodieWriteConfig config, HoodieTable table) {
+this.config = config;
+this.table = table;
+this.metaClient = table.getMetaClient();
+  }
+
+  public static ArchivedTimelineWriter getInstance(HoodieWriteConfig config, 
HoodieTable table) {
+return new ArchivedTimelineWriter(config, table);
+  }
+
+  public void write(HoodieEngineContext context, List instants) 
throws HoodieCommitException {
+Path filePath = new Path(metaClient.getArchivePath(),
+newFileName(instants.get(0).getInstantTime(), 
instants.get(instants.size() - 1).getInstantTime(), 
HoodieArchivedTimeline.FILE_LAYER_ZERO));
+try (HoodieFileWriter writer = openWriter(filePath)) {
+  Schema wrapperSchema = HoodieArchivedInstant.getClassSchema();
+  LOG.info("Archiving schema " + wrapperSchema.toString());
+  for (ActiveInstant triple : instants) {
+try {
+  deleteAnyLeftOverMarkers(context, triple);
+  // in local FS and HDFS, there could be empty completed instants due 
to crash.
+  final HoodieArchivedInstant metaEntry = 
MetadataConversionUtils.createArchivedInstant(triple, metaClient);
+  writer.write(metaEntry.getInstantTime(), new 
HoodieAvroIndexedRecord(metaEntry), wrapperSchema);
+} catch (Exception e) {
+  LOG.error("Failed to archive instant: " + triple.getInstantTime(), 
e);
+  if (this.config.isFailOnTimelineArchivingEnabled()) {
+throw e;
+  }
+}
+  }
+  updateManifest(filePath.getName());
+} catch (Exception e) {
+  throw new HoodieCommitException("Failed to archive commits", e);
+}
+  }
+
+  public void updateManifest(String 

[GitHub] [hudi] bhasudha commented on pull request #9406: [DOCS] Update Metadata table and metadata indexing related pages

2023-08-09 Thread via GitHub


bhasudha commented on PR #9406:
URL: https://github.com/apache/hudi/pull/9406#issuecomment-1670918525

   @codope  Just FYI. This PR can be reviewed but must be merged after this 
[PR](https://github.com/apache/hudi/pull/9372) is merged for dependency on page 
links. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >