[GitHub] [hudi] bvaradar commented on issue #2346: [SUPPORT]The rt view query returns a wrong result with predicate push down.
bvaradar commented on issue #2346: URL: https://github.com/apache/hudi/issues/2346#issuecomment-768102437 Closing this GH as we have a jira to track it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #2346: [SUPPORT]The rt view query returns a wrong result with predicate push down.
bvaradar closed issue #2346: URL: https://github.com/apache/hudi/issues/2346 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhedoubushishi commented on pull request #2485: [HUDI-1109] Support Spark Structured Streaming read from Hudi table
zhedoubushishi commented on pull request #2485: URL: https://github.com/apache/hudi/pull/2485#issuecomment-768083458 Can you check if this change is compatible with Spark 3.0.0? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2495: [HUDI-1553] Configuration and metrics for the TimelineService.
codecov-io commented on pull request #2495: URL: https://github.com/apache/hudi/pull/2495#issuecomment-768052489 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=h1) Report > Merging [#2495](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=desc) (3397ed3) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **increase** coverage by `11.10%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2495/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2495 +/- ## = + Coverage 50.18% 61.29% +11.10% + Complexity 3051 318 -2733 = Files 419 53 -366 Lines 18931 1930-17001 Branches 1948 230 -1718 = - Hits 9501 1183 -8318 + Misses 8656 623 -8033 + Partials774 124 -650 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `61.29% <ø> (-8.19%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | | | [...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh) | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | | | [...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh) | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | | | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `40.46% <0.00%> (-23.70%)` | `27.00% <0.00%> (-6.00%)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | | | [.../hadoop/utils/HoodieRealtimeRecordReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lUmVjb3JkUmVhZGVyVXRpbHMuamF2YQ==) | | | | | [.../hudi/hadoop/realtime/HoodieRealtimeFileSplit.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZVJlYWx0aW1lRmlsZVNwbGl0LmphdmE=) | | | | | [...rg/apache/hudi/common/fs/inline/InLineFSUtils.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9JbkxpbmVGU1V0aWxzLmphdmE=) | | | | | [...e/hudi/common/util/collection/ImmutableTriple.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9JbW11dGFibGVUcmlwbGUuamF2YQ==) | | | | | ... and [360
[GitHub] [hudi] danny0405 commented on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on pull request #2430: URL: https://github.com/apache/hudi/pull/2430#issuecomment-768022341 > @danny0405 sorry for the delay on review, I was super busy this week. The bloom index was merged to master, can we add the bloom index option to this PR as well? I'm not planning to using the BloomFilter index in the new pipeline, instead there is a BloomFilter index backed state index in the following PR, which is more suitable for streaming write. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565021738 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java ## @@ -0,0 +1,413 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.client.FlinkTaskContextSupplier; +import org.apache.hudi.client.HoodieFlinkWriteClient; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.client.common.HoodieFlinkEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.operator.event.BatchWriteSuccessEvent; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.annotation.VisibleForTesting; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.memory.DataInputViewStreamWrapper; +import org.apache.flink.core.memory.DataOutputViewStreamWrapper; +import org.apache.flink.runtime.jobgraph.OperatorID; +import org.apache.flink.runtime.operators.coordination.OperatorCoordinator; +import org.apache.flink.runtime.operators.coordination.OperatorEvent; +import org.apache.flink.util.Preconditions; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.jetbrains.annotations.Nullable; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.CompletionException; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +/** + * {@link OperatorCoordinator} for {@link StreamWriteFunction}. + * + * This coordinator starts a new instant when a new checkpoint starts. It commits the instant when all the + * operator tasks write the buffer successfully for a round of checkpoint. + * + * If there is no data for a round of checkpointing, it rolls back the metadata. + * + * @see StreamWriteFunction for the work flow and semantics + */ +public class StreamWriteOperatorCoordinator +implements OperatorCoordinator { + private static final Logger LOG = LoggerFactory.getLogger(StreamWriteOperatorCoordinator.class); + + /** + * Config options. + */ + private final Configuration conf; + + /** + * Write client. + */ + private transient HoodieFlinkWriteClient writeClient; + + private long inFlightCheckpoint = -1; + + /** + * Current REQUESTED instant, for validation. + */ + private String inFlightInstant = ""; + + /** + * Event buffer for one round of checkpointing. When all the elements are non-null and have the same + * write instant, then the instant succeed and we can commit it. + */ + private transient BatchWriteSuccessEvent[] eventBuffer; + + /** + * Task number of the operator. + */ + private final int parallelism; + + /** + * Constructs a StreamingSinkOperatorCoordinator. + * + * @param confThe config options + * @param parallelism The operator task number + */ + public StreamWriteOperatorCoordinator( + Configuration conf, + int parallelism) { +this.conf = conf; +this.parallelism = parallelism; + } + + @Override + public void start() throws Exception { +// initialize event buffer +reset(); +// writeClient +initWriteClient(); +// init table, create it if not exists. +initTable(); + } + + @Override + public void close() { +if (writeClient != null) { + writeClient.close(); +} +this.eventBuffer = null; + } + + @Override + public void checkpointCoordinator(long checkpointId, CompletableFuture result) { +try { + final String errMsg = "A new checkpoint starts while the last checkpoint buffer" +
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565020444 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java ## @@ -0,0 +1,413 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.client.FlinkTaskContextSupplier; +import org.apache.hudi.client.HoodieFlinkWriteClient; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.client.common.HoodieFlinkEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.operator.event.BatchWriteSuccessEvent; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.annotation.VisibleForTesting; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.memory.DataInputViewStreamWrapper; +import org.apache.flink.core.memory.DataOutputViewStreamWrapper; +import org.apache.flink.runtime.jobgraph.OperatorID; +import org.apache.flink.runtime.operators.coordination.OperatorCoordinator; +import org.apache.flink.runtime.operators.coordination.OperatorEvent; +import org.apache.flink.util.Preconditions; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.jetbrains.annotations.Nullable; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.CompletionException; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +/** + * {@link OperatorCoordinator} for {@link StreamWriteFunction}. + * + * This coordinator starts a new instant when a new checkpoint starts. It commits the instant when all the + * operator tasks write the buffer successfully for a round of checkpoint. + * + * If there is no data for a round of checkpointing, it rolls back the metadata. + * + * @see StreamWriteFunction for the work flow and semantics + */ +public class StreamWriteOperatorCoordinator +implements OperatorCoordinator { + private static final Logger LOG = LoggerFactory.getLogger(StreamWriteOperatorCoordinator.class); + + /** + * Config options. + */ + private final Configuration conf; + + /** + * Write client. + */ + private transient HoodieFlinkWriteClient writeClient; + + private long inFlightCheckpoint = -1; + + /** + * Current REQUESTED instant, for validation. + */ + private String inFlightInstant = ""; + + /** + * Event buffer for one round of checkpointing. When all the elements are non-null and have the same + * write instant, then the instant succeed and we can commit it. + */ + private transient BatchWriteSuccessEvent[] eventBuffer; + + /** + * Task number of the operator. + */ + private final int parallelism; + + /** + * Constructs a StreamingSinkOperatorCoordinator. + * + * @param confThe config options + * @param parallelism The operator task number + */ + public StreamWriteOperatorCoordinator( + Configuration conf, + int parallelism) { +this.conf = conf; +this.parallelism = parallelism; + } + + @Override + public void start() throws Exception { +// initialize event buffer +reset(); +// writeClient +initWriteClient(); +// init table, create it if not exists. +initTable(); + } + + @Override + public void close() { +if (writeClient != null) { + writeClient.close(); +} +this.eventBuffer = null; + } + + @Override + public void checkpointCoordinator(long checkpointId, CompletableFuture result) { +try { + final String errMsg = "A new checkpoint starts while the last checkpoint buffer" +
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565019061 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.streamer.FlinkStreamerConfig; +import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.keygen.SimpleAvroKeyGenerator; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.configuration.ConfigOption; +import org.apache.flink.configuration.ConfigOptions; +import org.apache.flink.configuration.Configuration; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Hoodie Flink config options. + * + * It has the options for Hoodie table read and write. It also defines some utilities. + */ +public class FlinkOptions { + private FlinkOptions() { + } + + // + // Base Options + // + public static final ConfigOption PATH = ConfigOptions + .key("path") + .stringType() + .noDefaultValue() + .withDescription("Base path for the target hoodie table." + + "\nThe path would be created if it does not exist,\n" + + "otherwise a Hoodie table expects to be initialized successfully"); + + // + // Read Options + // + public static final ConfigOption READ_SCHEMA_FILE_PATH = ConfigOptions + .key("read.schema.file.path") + .stringType() + .noDefaultValue() + .withDescription("Avro schema file path, the parsed schema is used for deserializing"); + + // + // Write Options + // + public static final ConfigOption TABLE_NAME = ConfigOptions + .key(HoodieWriteConfig.TABLE_NAME) + .stringType() + .noDefaultValue() + .withDescription("Table name to register to Hive metastore"); + + public static final ConfigOption TABLE_TYPE = ConfigOptions + .key("write.table.type") + .stringType() + .defaultValue("COPY_ON_WRITE") + .withDescription("Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ"); + + public static final ConfigOption OPERATION = ConfigOptions + .key("write.operation") + .stringType() + .defaultValue("upsert") + .withDescription("The write operation, that this write should do"); + + public static final ConfigOption PRECOMBINE_FIELD = ConfigOptions + .key("write.precombine.field") + .stringType() + .defaultValue("ts") + .withDescription("Field used in preCombining before actual write. When two records have the same\n" + + "key value, we will pick the one with the largest value for the precombine field,\n" + + "determined by Object.compareTo(..)"); + + public static final ConfigOption PAYLOAD_CLASS = ConfigOptions + .key("write.payload.class") + .stringType() + .defaultValue(OverwriteWithLatestAvroPayload.class.getName()) + .withDescription("Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting.\n" + + "This will render any value set for the option in-effective"); + + /** + * Flag to indicate whether to drop duplicates upon insert. + * By default insert will accept duplicates, to gain extra performance. + */ + public static final ConfigOption INSERT_DROP_DUPS = ConfigOptions + .key("write.insert.drop.duplicates") + .booleanType() + .defaultValue(false) + .withDescription("Flag to indicate whether to drop duplicates upon insert.\n" + + "By default insert will accept duplicates, to
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565017180 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.streamer.FlinkStreamerConfig; +import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.keygen.SimpleAvroKeyGenerator; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.configuration.ConfigOption; +import org.apache.flink.configuration.ConfigOptions; +import org.apache.flink.configuration.Configuration; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Hoodie Flink config options. + * + * It has the options for Hoodie table read and write. It also defines some utilities. + */ +public class FlinkOptions { + private FlinkOptions() { + } + + // + // Base Options + // + public static final ConfigOption PATH = ConfigOptions + .key("path") + .stringType() + .noDefaultValue() + .withDescription("Base path for the target hoodie table." + + "\nThe path would be created if it does not exist,\n" + + "otherwise a Hoodie table expects to be initialized successfully"); + + // + // Read Options + // + public static final ConfigOption READ_SCHEMA_FILE_PATH = ConfigOptions + .key("read.schema.file.path") + .stringType() + .noDefaultValue() + .withDescription("Avro schema file path, the parsed schema is used for deserializing"); + + // + // Write Options + // + public static final ConfigOption TABLE_NAME = ConfigOptions + .key(HoodieWriteConfig.TABLE_NAME) + .stringType() + .noDefaultValue() + .withDescription("Table name to register to Hive metastore"); + + public static final ConfigOption TABLE_TYPE = ConfigOptions + .key("write.table.type") + .stringType() + .defaultValue("COPY_ON_WRITE") + .withDescription("Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ"); + + public static final ConfigOption OPERATION = ConfigOptions + .key("write.operation") + .stringType() + .defaultValue("upsert") Review comment: No, see `WriteOperationType#fromValue`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565017180 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.streamer.FlinkStreamerConfig; +import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.keygen.SimpleAvroKeyGenerator; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.configuration.ConfigOption; +import org.apache.flink.configuration.ConfigOptions; +import org.apache.flink.configuration.Configuration; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Hoodie Flink config options. + * + * It has the options for Hoodie table read and write. It also defines some utilities. + */ +public class FlinkOptions { + private FlinkOptions() { + } + + // + // Base Options + // + public static final ConfigOption PATH = ConfigOptions + .key("path") + .stringType() + .noDefaultValue() + .withDescription("Base path for the target hoodie table." + + "\nThe path would be created if it does not exist,\n" + + "otherwise a Hoodie table expects to be initialized successfully"); + + // + // Read Options + // + public static final ConfigOption READ_SCHEMA_FILE_PATH = ConfigOptions + .key("read.schema.file.path") + .stringType() + .noDefaultValue() + .withDescription("Avro schema file path, the parsed schema is used for deserializing"); + + // + // Write Options + // + public static final ConfigOption TABLE_NAME = ConfigOptions + .key(HoodieWriteConfig.TABLE_NAME) + .stringType() + .noDefaultValue() + .withDescription("Table name to register to Hive metastore"); + + public static final ConfigOption TABLE_TYPE = ConfigOptions + .key("write.table.type") + .stringType() + .defaultValue("COPY_ON_WRITE") + .withDescription("Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ"); + + public static final ConfigOption OPERATION = ConfigOptions + .key("write.operation") + .stringType() + .defaultValue("upsert") Review comment: not see `WriteOperationType#fromValue`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565017180 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.streamer.FlinkStreamerConfig; +import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.keygen.SimpleAvroKeyGenerator; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.configuration.ConfigOption; +import org.apache.flink.configuration.ConfigOptions; +import org.apache.flink.configuration.Configuration; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Hoodie Flink config options. + * + * It has the options for Hoodie table read and write. It also defines some utilities. + */ +public class FlinkOptions { + private FlinkOptions() { + } + + // + // Base Options + // + public static final ConfigOption PATH = ConfigOptions + .key("path") + .stringType() + .noDefaultValue() + .withDescription("Base path for the target hoodie table." + + "\nThe path would be created if it does not exist,\n" + + "otherwise a Hoodie table expects to be initialized successfully"); + + // + // Read Options + // + public static final ConfigOption READ_SCHEMA_FILE_PATH = ConfigOptions + .key("read.schema.file.path") + .stringType() + .noDefaultValue() + .withDescription("Avro schema file path, the parsed schema is used for deserializing"); + + // + // Write Options + // + public static final ConfigOption TABLE_NAME = ConfigOptions + .key(HoodieWriteConfig.TABLE_NAME) + .stringType() + .noDefaultValue() + .withDescription("Table name to register to Hive metastore"); + + public static final ConfigOption TABLE_TYPE = ConfigOptions + .key("write.table.type") + .stringType() + .defaultValue("COPY_ON_WRITE") + .withDescription("Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ"); + + public static final ConfigOption OPERATION = ConfigOptions + .key("write.operation") + .stringType() + .defaultValue("upsert") Review comment: not now ~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r565016901 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.operator; + +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.streamer.FlinkStreamerConfig; +import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.keygen.SimpleAvroKeyGenerator; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.configuration.ConfigOption; +import org.apache.flink.configuration.ConfigOptions; +import org.apache.flink.configuration.Configuration; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Hoodie Flink config options. + * + * It has the options for Hoodie table read and write. It also defines some utilities. + */ +public class FlinkOptions { + private FlinkOptions() { + } + + // + // Base Options + // + public static final ConfigOption PATH = ConfigOptions + .key("path") + .stringType() + .noDefaultValue() + .withDescription("Base path for the target hoodie table." + + "\nThe path would be created if it does not exist,\n" + + "otherwise a Hoodie table expects to be initialized successfully"); + + // + // Read Options + // + public static final ConfigOption READ_SCHEMA_FILE_PATH = ConfigOptions + .key("read.schema.file.path") + .stringType() + .noDefaultValue() + .withDescription("Avro schema file path, the parsed schema is used for deserializing"); + + // + // Write Options + // + public static final ConfigOption TABLE_NAME = ConfigOptions + .key(HoodieWriteConfig.TABLE_NAME) + .stringType() + .noDefaultValue() + .withDescription("Table name to register to Hive metastore"); + + public static final ConfigOption TABLE_TYPE = ConfigOptions + .key("write.table.type") + .stringType() + .defaultValue("COPY_ON_WRITE") + .withDescription("Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ"); Review comment: Replace with `,` instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer
garyli1019 commented on pull request #2430: URL: https://github.com/apache/hudi/pull/2430#issuecomment-767994932 @danny0405 sorry for the delay on review, I was super busy this week. The bloom index was merged to master, can we add the bloom index option to this PR as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #2496: [HUDI-1554] Introduced buffering for streams in HUDI.
prashantwason commented on pull request #2496: URL: https://github.com/apache/hudi/pull/2496#issuecomment-767988430 @n3nash Please review as this may provide benefits for HDFS workloads. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1554) Introduce buffering for streams in HUDI
[ https://issues.apache.org/jira/browse/HUDI-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1554: - Labels: pull-request-available (was: ) > Introduce buffering for streams in HUDI > --- > > Key: HUDI-1554 > URL: https://issues.apache.org/jira/browse/HUDI-1554 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > > Input and Output streams created in HUDI through calls to > HoodieWrapperFileSystem do not include any buffering unless the underlying > file system implements buffering. > DistributedFileSystem (over HDFS) does not implement any buffering. This > leads to very large number of small-sized IO calls being send to the HDFS > while performing HUDI IO operations like reading parquet, writing parquet, > reading/writing log files, reading/writing instants, etc. > This patch introduces buffering at the HoodieWrapperFileSystem level so that > all types of reads and writes benefit from buffering. > > In my tests with at scale on HDFS writing 1million records into a parquet > file (read from an existing parquet file in the same dataset), I observed the > following benefits: > # about 40% reduction in total time to run the test > # Total write calls to HDFS reduced from 19.1M -> 328 > # Total read calls reduced from 229M -> 515K > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason opened a new pull request #2496: [HUDI-1554] Introduced buffering for streams in HUDI.
prashantwason opened a new pull request #2496: URL: https://github.com/apache/hudi/pull/2496 ## What is the purpose of the pull request Input and Output streams created in HUDI through calls to HoodieWrapperFileSystem do not include any buffering unless the underlying file system implements buffering. This patch introduces buffering at the HoodieWrapperFileSystem level so that all types of reads and writes benefit from buffering. ## Brief change log HoodieWrapperFileSystem changed to introduce BufferedStreams. ## Verify this pull request This pull request is already covered by existing tests. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1554) Introduce buffering for streams in HUDI
Prashant Wason created HUDI-1554: Summary: Introduce buffering for streams in HUDI Key: HUDI-1554 URL: https://issues.apache.org/jira/browse/HUDI-1554 Project: Apache Hudi Issue Type: Improvement Reporter: Prashant Wason Assignee: Prashant Wason Input and Output streams created in HUDI through calls to HoodieWrapperFileSystem do not include any buffering unless the underlying file system implements buffering. DistributedFileSystem (over HDFS) does not implement any buffering. This leads to very large number of small-sized IO calls being send to the HDFS while performing HUDI IO operations like reading parquet, writing parquet, reading/writing log files, reading/writing instants, etc. This patch introduces buffering at the HoodieWrapperFileSystem level so that all types of reads and writes benefit from buffering. In my tests with at scale on HDFS writing 1million records into a parquet file (read from an existing parquet file in the same dataset), I observed the following benefits: # about 40% reduction in total time to run the test # Total write calls to HDFS reduced from 19.1M -> 328 # Total read calls reduced from 229M -> 515K -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] vinothchandar commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-767962764 >The size of the base file was 3MB so this means that the in-memory HFile block caching was also working. Trying to understand this part. Was the workload, trying to fetch all the keys out of the HFile or just 1? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
codecov-io commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-767956391 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=h1) Report > Merging [#2494](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=desc) (19894f6) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **decrease** coverage by `40.49%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2494/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2494 +/- ## - Coverage 50.18% 9.68% -40.50% + Complexity 3051 48 -3003 Files 419 53 -366 Lines 189311930-17001 Branches 1948 230 -1718 - Hits 9501 187 -9314 + Misses 86561730 -6926 + Partials774 13 -761 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.80%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[jira] [Updated] (HUDI-1553) Add configs for TimelineServer to configure Jetty
[ https://issues.apache.org/jira/browse/HUDI-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1553: - Labels: pull-request-available (was: ) > Add configs for TimelineServer to configure Jetty > - > > Key: HUDI-1553 > URL: https://issues.apache.org/jira/browse/HUDI-1553 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > > TimelineServer uses Javalin which is based on Jetty. > By default Jetty: > * Has 200 threads > * Compresses output by gzip > * Handles each request sequentially > > On a large-scale HUDI dataset (2000 partitions), when TimelineServer is > enabled, the operations slow down due to following reasons: > # Driver process usually has a few cores. 200 Jetty threads lead to huge > contention when 100s of executors connect to the Server in parallel. > # To handle large number of requests in parallel, its better to handle each > HTTP request in an asynchronous manner using Futures which are supported by > Javalin. > # The compute overhead of gzipping may not be necessary when the executors > and driver are in the same rack or within the same datacenter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason opened a new pull request #2495: [HUDI-1553] Configuration and metrics for the TimelineService.
prashantwason opened a new pull request #2495: URL: https://github.com/apache/hudi/pull/2495 ## What is the purpose of the pull request TimelineServer uses Javalin which is based on Jetty. By default Jetty: Has 200 threads Compresses output by gzip Handles each request sequentially On a large-scale HUDI dataset (2000 partitions), when TimelineServer is enabled, the operations slow down due to following reasons: - Driver process usually has a few cores. 200 Jetty threads lead to huge contention when 100s of executors connect to the Server in parallel. - To handle large number of requests in parallel, its better to handle each HTTP request in an asynchronous manner using Futures which are supported by Javalin. - The compute overhead of gzipping may not be necessary when the executors and driver are in the same rack or within the same datacenter ## Brief change log Added settings to control the number of threads created, whether to gzip output and to use asynchronous processing of requests. With all the settings enabled, a driver process with 8 cores is able to handle 1024 executors in parallel on a table with 2000 partitions (CLEAN operation which lists all partitions). The time per API requests was also reduced from 800msec to 60msec. ## Verify this pull request This pull request is already covered by existing tests, such as TimelineServer tests and integration tests. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1553) Add configs for TimelineServer to configure Jetty
[ https://issues.apache.org/jira/browse/HUDI-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1553: - Status: In Progress (was: Open) > Add configs for TimelineServer to configure Jetty > - > > Key: HUDI-1553 > URL: https://issues.apache.org/jira/browse/HUDI-1553 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > > TimelineServer uses Javalin which is based on Jetty. > By default Jetty: > * Has 200 threads > * Compresses output by gzip > * Handles each request sequentially > > On a large-scale HUDI dataset (2000 partitions), when TimelineServer is > enabled, the operations slow down due to following reasons: > # Driver process usually has a few cores. 200 Jetty threads lead to huge > contention when 100s of executors connect to the Server in parallel. > # To handle large number of requests in parallel, its better to handle each > HTTP request in an asynchronous manner using Futures which are supported by > Javalin. > # The compute overhead of gzipping may not be necessary when the executors > and driver are in the same rack or within the same datacenter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1553) Add configs for TimelineServer to configure Jetty
Prashant Wason created HUDI-1553: Summary: Add configs for TimelineServer to configure Jetty Key: HUDI-1553 URL: https://issues.apache.org/jira/browse/HUDI-1553 Project: Apache Hudi Issue Type: Improvement Reporter: Prashant Wason Assignee: Prashant Wason TimelineServer uses Javalin which is based on Jetty. By default Jetty: * Has 200 threads * Compresses output by gzip * Handles each request sequentially On a large-scale HUDI dataset (2000 partitions), when TimelineServer is enabled, the operations slow down due to following reasons: # Driver process usually has a few cores. 200 Jetty threads lead to huge contention when 100s of executors connect to the Server in parallel. # To handle large number of requests in parallel, its better to handle each HTTP request in an asynchronous manner using Futures which are supported by Javalin. # The compute overhead of gzipping may not be necessary when the executors and driver are in the same rack or within the same datacenter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io commented on pull request #2493: [WIP] Change another way to convert Path with Scheme
codecov-io commented on pull request #2493: URL: https://github.com/apache/hudi/pull/2493#issuecomment-767881117 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=h1) Report > Merging [#2493](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=desc) (ede32d2) into [master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc) (c8ee40f) will **decrease** coverage by `40.49%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2493/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2493 +/- ## - Coverage 50.18% 9.68% -40.50% + Complexity 3051 48 -3003 Files 419 53 -366 Lines 189311930-17001 Branches 1948 230 -1718 - Hits 9501 187 -9314 + Misses 86561730 -6926 + Partials774 13 -761 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.80%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[jira] [Updated] (HUDI-1552) Improve performance of key lookups from base file (HFile) in Metadata table
[ https://issues.apache.org/jira/browse/HUDI-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1552: - Labels: pull-request-available (was: ) > Improve performance of key lookups from base file (HFile) in Metadata table > --- > > Key: HUDI-1552 > URL: https://issues.apache.org/jira/browse/HUDI-1552 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason opened a new pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
prashantwason opened a new pull request #2494: URL: https://github.com/apache/hudi/pull/2494 ## What is the purpose of the pull request Improves the performance of key lookups from Metadata Table. In my scale testing with 150 partitions and 100K+ files on HDFS, the time to read the key was reduced (100ms avg -> 10ms) and the total data read from the HFile was reduced (85MB -> 3MB). The size of the base file was 3MB so this means that the in-memory HFile block caching was also working. ## Brief change log 1. Cache the KeyScanner across lookups so that the HFile index does not have to be read for each lookup. 2. Enable block caching in KeyScanner. 3. Move the lock to a limited scope of the code to reduce lock contention. ## Verify this pull request This pull request is already covered by existing tests, such as *(please describe tests)*. mvn test -pl hudi-client/hudi-spark-client -Dtest=TestHoodieBackedMetadata ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1552) Improve performance of key lookups from base file (HFile) in Metadata table
Prashant Wason created HUDI-1552: Summary: Improve performance of key lookups from base file (HFile) in Metadata table Key: HUDI-1552 URL: https://issues.apache.org/jira/browse/HUDI-1552 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Assignee: Prashant Wason -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] kimberlyamandalu commented on issue #1977: Error running hudi on aws glue
kimberlyamandalu commented on issue #1977: URL: https://github.com/apache/hudi/issues/1977#issuecomment-767812424 Quick share: https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhedoubushishi commented on pull request #2485: [HUDI-1109] Support Spark Structured Streaming read from Hudi table
zhedoubushishi commented on pull request #2485: URL: https://github.com/apache/hudi/pull/2485#issuecomment-767810591 > @pengzhiwei2018 thanks for your contribution. Left some comments but I am not quite familiar with Structured streaming. @zhedoubushishi mind taking a pass as well? Sure will take a look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-825) Write a small blog on how to use hudi-spark with pyspark
[ https://issues.apache.org/jira/browse/HUDI-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-825: - Status: Open (was: New) > Write a small blog on how to use hudi-spark with pyspark > > > Key: HUDI-825 > URL: https://issues.apache.org/jira/browse/HUDI-825 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-259) Hadoop 3 support for Hudi writing
[ https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-259: - Labels: user-support-issues (was: bug-bash-0.6.0) > Hadoop 3 support for Hudi writing > - > > Key: HUDI-259 > URL: https://issues.apache.org/jira/browse/HUDI-259 > Project: Apache Hudi > Issue Type: Improvement > Components: Usability >Reporter: Vinoth Chandar >Assignee: Wenning Ding >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > > Sample issues > > [https://github.com/apache/incubator-hudi/issues/735] > [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] > [https://github.com/apache/incubator-hudi/issues/898] > > https://github.com/apache/hudi/issues/1776 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false
[ https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272398#comment-17272398 ] sivabalan narayanan commented on HUDI-281: -- [~uditme]: is this still a valid issue? > HiveSync failure through Spark when useJdbc is set to false > --- > > Key: HUDI-281 > URL: https://issues.apache.org/jira/browse/HUDI-281 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration, Spark Integration, Usability >Reporter: Udit Mehrotra >Priority: Major > Labels: bug-bash-0.6.0 > > Table creation with Hive sync through Spark fails, when I set *useJdbc* to > *false*. Currently I had to modify the code to set *useJdbc* to *false* as > there is not *DataSourceOption* through which I can specify this field when > running Hudi code. > Here is the failure: > {noformat} > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState; > at > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527) > at > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517) > at > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507) > at > org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272) > at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68) > at > org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat} > I was expecting this to fail through Spark, becuase *hive-exec* is not shaded > inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This > *SessionState* is coming from the spark-hive jar and obviously it does not > accept the relocated *HiveConf*. > We in *EMR* are running into same problem when trying to integrate with Glue > Catalog. For this we have to create Hive metastore client through > *Hive.get(conf).getMsc()* instead of how it is being down now, so that > alternate implementations of metastore can get created. However, because > hive-exec is not shaded but HiveConf is relocated we run into same issues > there. > It would not be recommended to shade *hive-exec* either because it itself is > an Uber jar that shades a lot of things, and all of them would end up in > *hudi-spark-bundle* jar. We would not want to head that route.
[jira] [Updated] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false
[ https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-281: - Labels: user-support-issues (was: bug-bash-0.6.0) > HiveSync failure through Spark when useJdbc is set to false > --- > > Key: HUDI-281 > URL: https://issues.apache.org/jira/browse/HUDI-281 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration, Spark Integration, Usability >Reporter: Udit Mehrotra >Priority: Major > Labels: user-support-issues > > Table creation with Hive sync through Spark fails, when I set *useJdbc* to > *false*. Currently I had to modify the code to set *useJdbc* to *false* as > there is not *DataSourceOption* through which I can specify this field when > running Hudi code. > Here is the failure: > {noformat} > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState; > at > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527) > at > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517) > at > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507) > at > org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272) > at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68) > at > org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat} > I was expecting this to fail through Spark, becuase *hive-exec* is not shaded > inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This > *SessionState* is coming from the spark-hive jar and obviously it does not > accept the relocated *HiveConf*. > We in *EMR* are running into same problem when trying to integrate with Glue > Catalog. For this we have to create Hive metastore client through > *Hive.get(conf).getMsc()* instead of how it is being down now, so that > alternate implementations of metastore can get created. However, because > hive-exec is not shaded but HiveConf is relocated we run into same issues > there. > It would not be recommended to shade *hive-exec* either because it itself is > an Uber jar that shades a lot of things, and all of them would end up in > *hudi-spark-bundle* jar. We would not want to head that route. That is why, > we
[jira] [Updated] (HUDI-280) Integrate Hudi to bigtop
[ https://issues.apache.org/jira/browse/HUDI-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-280: - Labels: user-support-issues (was: ) > Integrate Hudi to bigtop > > > Key: HUDI-280 > URL: https://issues.apache.org/jira/browse/HUDI-280 > Project: Apache Hudi > Issue Type: Improvement > Components: Usability >Reporter: Vinoth Chandar >Assignee: leesf >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi
[ https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272397#comment-17272397 ] Vinoth Govindarajan commented on HUDI-824: -- [~nagarwal] - All the apache projects are available directly to use with `–packages` option, I tried with pyspark it worked: {code:java} spark-shell \ --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' {code} The same instructions have been updated in the following doc: [https://hudi.apache.org/docs/quick-start-guide.html] No further action need, let me know if its okay to close this issue. > Register hudi-spark package with spark packages repo for easier usage of Hudi > - > > Key: HUDI-824 > URL: https://issues.apache.org/jira/browse/HUDI-824 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: user-support-issues > > At the moment, to be able to use Hudi with spark, users have to do the > following : > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --jars `ls > packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` > \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}} > {{}} > {{Ideally, we want to be able to use Hudi as follows :}} > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages > org.apache.hudi:hudi-spark-bundle: \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-282) Update documentation to reflect additional option of HiveSync via metastore
[ https://issues.apache.org/jira/browse/HUDI-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-282. -- Fix Version/s: 0.5.2 Resolution: Fixed > Update documentation to reflect additional option of HiveSync via metastore > --- > > Key: HUDI-282 > URL: https://issues.apache.org/jira/browse/HUDI-282 > Project: Apache Hudi > Issue Type: Task > Components: Hive Integration >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-295) Do one-time cleanup of Hudi git history
[ https://issues.apache.org/jira/browse/HUDI-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272396#comment-17272396 ] sivabalan narayanan commented on HUDI-295: -- [~vinoth]: is this still valid ? > Do one-time cleanup of Hudi git history > --- > > Key: HUDI-295 > URL: https://issues.apache.org/jira/browse/HUDI-295 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Vinoth Chandar >Priority: Major > > https://lists.apache.org/thread.html/dc6eb516e248088dac1a2b5c9690383dfe2eb3912f76bbe9dd763c2b@ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-303) Avro schema case sensitivity testing
[ https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-303: - Labels: user-support-issues (was: bug-bash-0.6.0) > Avro schema case sensitivity testing > > > Key: HUDI-303 > URL: https://issues.apache.org/jira/browse/HUDI-303 > Project: Apache Hudi > Issue Type: Test > Components: Spark Integration >Reporter: Udit Mehrotra >Assignee: liwei >Priority: Minor > Labels: user-support-issues > > As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we > would like to understand how Avro behaves with case sensitive column names. > Couple of action items: > * Test with different field names just differing in case. > * *AbstractRealtimeRecordReader* is one of the classes where we are > converting Avro Schema field names to lower case, to be able to verify them > against column names from Hive. We can consider removing the *lowercase* > conversion there if we verify it does not break anything. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-306) Get to Hudi to support AWS Glue Catalog and other Hive Metastore implementations
[ https://issues.apache.org/jira/browse/HUDI-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-306. -- Fix Version/s: 0.5.2 Resolution: Fixed > Get to Hudi to support AWS Glue Catalog and other Hive Metastore > implementations > > > Key: HUDI-306 > URL: https://issues.apache.org/jira/browse/HUDI-306 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Hudi currently does not work with AWS Glue Catalog. The issue/exception it > runs into has been reported here as well > [issue|https://github.com/apache/incubator-hudi/issues/954] . > As mentioned in the issue, the reason for this is: > * Currently Hudi is interacting with Hive through two different ways: > ** Creation of table statement is submitted directly to Hive via JDBC > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L472] > . Thus, Hive will internally create the right metastore client (i.e. Glue if > {{*hive.metastore.client.factory.class*}} is set to > {{*com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory*}} > in hive-site) > ** Whereas partition listing among other things are being done by directly > calling hive metastore APIs using hive metastore client: > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L240] > * Now in Hudi code, standard specific implementation of the metastore client > (not glue metastore client) is being instantiated: > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L109] > . > * Ideally this instantiation of metastore client should be left to Hive > through > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5045] > for it to consider other implementations of metastore client that might be > configured through {{*hive.metastore.client.factory.class*}} . > That is the reason that table gets created in Glue metastore, but while > reading or scanning partitions it is talking to the local hive metastore > where it does not find the table created. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-306) Get to Hudi to support AWS Glue Catalog and other Hive Metastore implementations
[ https://issues.apache.org/jira/browse/HUDI-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reopened HUDI-306: -- > Get to Hudi to support AWS Glue Catalog and other Hive Metastore > implementations > > > Key: HUDI-306 > URL: https://issues.apache.org/jira/browse/HUDI-306 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hudi currently does not work with AWS Glue Catalog. The issue/exception it > runs into has been reported here as well > [issue|https://github.com/apache/incubator-hudi/issues/954] . > As mentioned in the issue, the reason for this is: > * Currently Hudi is interacting with Hive through two different ways: > ** Creation of table statement is submitted directly to Hive via JDBC > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L472] > . Thus, Hive will internally create the right metastore client (i.e. Glue if > {{*hive.metastore.client.factory.class*}} is set to > {{*com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory*}} > in hive-site) > ** Whereas partition listing among other things are being done by directly > calling hive metastore APIs using hive metastore client: > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L240] > * Now in Hudi code, standard specific implementation of the metastore client > (not glue metastore client) is being instantiated: > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L109] > . > * Ideally this instantiation of metastore client should be left to Hive > through > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5045] > for it to consider other implementations of metastore client that might be > configured through {{*hive.metastore.client.factory.class*}} . > That is the reason that table gets created in Glue metastore, but while > reading or scanning partitions it is talking to the local hive metastore > where it does not find the table created. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-306) Get to Hudi to support AWS Glue Catalog and other Hive Metastore implementations
[ https://issues.apache.org/jira/browse/HUDI-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-306: - Status: Closed (was: Patch Available) > Get to Hudi to support AWS Glue Catalog and other Hive Metastore > implementations > > > Key: HUDI-306 > URL: https://issues.apache.org/jira/browse/HUDI-306 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hudi currently does not work with AWS Glue Catalog. The issue/exception it > runs into has been reported here as well > [issue|https://github.com/apache/incubator-hudi/issues/954] . > As mentioned in the issue, the reason for this is: > * Currently Hudi is interacting with Hive through two different ways: > ** Creation of table statement is submitted directly to Hive via JDBC > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L472] > . Thus, Hive will internally create the right metastore client (i.e. Glue if > {{*hive.metastore.client.factory.class*}} is set to > {{*com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory*}} > in hive-site) > ** Whereas partition listing among other things are being done by directly > calling hive metastore APIs using hive metastore client: > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L240] > * Now in Hudi code, standard specific implementation of the metastore client > (not glue metastore client) is being instantiated: > [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L109] > . > * Ideally this instantiation of metastore client should be left to Hive > through > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5045] > for it to consider other implementations of metastore client that might be > configured through {{*hive.metastore.client.factory.class*}} . > That is the reason that table gets created in Glue metastore, but while > reading or scanning partitions it is talking to the local hive metastore > where it does not find the table created. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-307) Dataframe written with Date,Timestamp, Decimal is read with same types
[ https://issues.apache.org/jira/browse/HUDI-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-307. -- Fix Version/s: (was: 0.8.0) 0.7.0 Resolution: Fixed > Dataframe written with Date,Timestamp, Decimal is read with same types > -- > > Key: HUDI-307 > URL: https://issues.apache.org/jira/browse/HUDI-307 > Project: Apache Hudi > Issue Type: Test > Components: Spark Integration >Reporter: Cosmin Iordache >Assignee: Udit Mehrotra >Priority: Minor > Labels: bug-bash-0.6.0, pull-request-available > Fix For: 0.7.0 > > > Small test for COW table to check the persistence of Date, Timestamp ,Decimal > types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-307) Dataframe written with Date,Timestamp, Decimal is read with same types
[ https://issues.apache.org/jira/browse/HUDI-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-307: - Status: In Progress (was: Open) > Dataframe written with Date,Timestamp, Decimal is read with same types > -- > > Key: HUDI-307 > URL: https://issues.apache.org/jira/browse/HUDI-307 > Project: Apache Hudi > Issue Type: Test > Components: Spark Integration >Reporter: Cosmin Iordache >Assignee: Udit Mehrotra >Priority: Minor > Labels: bug-bash-0.6.0, pull-request-available > Fix For: 0.8.0 > > > Small test for COW table to check the persistence of Date, Timestamp ,Decimal > types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272392#comment-17272392 ] sivabalan narayanan commented on HUDI-310: -- [~vinoth]: Is this still relevant? do we keep it open. > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Chandar >Assignee: Suneel Marthi >Priority: Major > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-310: - Labels: user-support-issues (was: ) > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Chandar >Assignee: Suneel Marthi >Priority: Major > Labels: user-support-issues > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-318) Update Migration Guide to Include Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-318: - Labels: doc user-support-issues (was: doc) > Update Migration Guide to Include Delta Streamer > > > Key: HUDI-318 > URL: https://issues.apache.org/jira/browse/HUDI-318 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Yanjia Gary Li >Priority: Minor > Labels: doc, user-support-issues > > [http://hudi.apache.org/migration_guide.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi
[ https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-824: - Status: Open (was: New) > Register hudi-spark package with spark packages repo for easier usage of Hudi > - > > Key: HUDI-824 > URL: https://issues.apache.org/jira/browse/HUDI-824 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: user-support-issues > > At the moment, to be able to use Hudi with spark, users have to do the > following : > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --jars `ls > packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` > \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}} > {{}} > {{Ideally, we want to be able to use Hudi as follows :}} > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages > org.apache.hudi:hudi-spark-bundle: \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-360) Add github stale action workflow for issue management
[ https://issues.apache.org/jira/browse/HUDI-360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-360: - Labels: user-support-issues (was: ) > Add github stale action workflow for issue management > - > > Key: HUDI-360 > URL: https://issues.apache.org/jira/browse/HUDI-360 > Project: Apache Hudi > Issue Type: Improvement > Components: Usability >Reporter: Gurudatt Kulkarni >Assignee: Gurudatt Kulkarni >Priority: Major > Labels: user-support-issues > > Add a GitHub action for closing stale (90 days) issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-352) The official documentation about project structure missed hudi-timeline-service module
[ https://issues.apache.org/jira/browse/HUDI-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-352: - Labels: starter user-support-issues (was: starter) > The official documentation about project structure missed > hudi-timeline-service module > -- > > Key: HUDI-352 > URL: https://issues.apache.org/jira/browse/HUDI-352 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: vinoyang >Priority: Major > Labels: starter, user-support-issues > > The official documentation about project structure[1] missed > hudi-timeline-service module, we should add it. > [1]: http://hudi.apache.org/contributing.html#code--project-structure -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-395) hudi does not support scheme s3n when wrtiing to S3
[ https://issues.apache.org/jira/browse/HUDI-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-395: - Labels: user-support-issues (was: bug-bash-0.6.0) > hudi does not support scheme s3n when wrtiing to S3 > --- > > Key: HUDI-395 > URL: https://issues.apache.org/jira/browse/HUDI-395 > Project: Apache Hudi > Issue Type: Bug > Components: newbie, Spark Integration, Usability > Environment: spark-2.4.4-bin-hadoop2.7 >Reporter: rui feng >Assignee: sivabalan narayanan >Priority: Major > Labels: user-support-issues > > When I use Hudi to create a hudi table then write to s3, I used below maven > snnipet which is recommended by [https://hudi.apache.org/s3_hoodie.html] > > org.apache.hudi > hudi-spark-bundle > 0.5.0-incubating > > > org.apache.hadoop > hadoop-aws > 2.7.3 > > > com.amazonaws > aws-java-sdk > 1.10.34 > > and add the below configuration: > sc.hadoopConfiguration.set("fs.defaultFS", "s3://niketest1") > sc.hadoopConfiguration.set("fs.s3.impl", > "org.apache.hadoop.fs.s3native.NativeS3FileSystem") > sc.hadoopConfiguration.set("fs.s3n.impl", > "org.apache.hadoop.fs.s3native.NativeS3FileSystem") > sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "xx") > sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "x") > sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xx") > sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "x") > > my spark version is spark-2.4.4-bin-hadoop2.7 and when I run below > {color:#FF}df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath).{color} > val hudiOptions = Map[String,String]( > HoodieWriteConfig.TABLE_NAME -> "hudi12", > DataSourceWriteOptions.OPERATION_OPT_KEY -> > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, > DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "rider", > DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) > val hudiTablePath = "s3://niketest1/hudi_test/hudi12" > the exception occur: > j{color:#FF}ava.lang.IllegalArgumentException: > BlockAlignedAvroParquetWriter does not support scheme s3n{color} > at > org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.getHoodieScheme(HoodieWrapperFileSystem.java:109) > at > org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.convertToHoodiePath(HoodieWrapperFileSystem.java:85) > at > org.apache.hudi.io.storage.HoodieParquetWriter.(HoodieParquetWriter.java:57) > at > org.apache.hudi.io.storage.HoodieStorageWriterFactory.newParquetStorageWriter(HoodieStorageWriterFactory.java:60) > at > org.apache.hudi.io.storage.HoodieStorageWriterFactory.getStorageWriter(HoodieStorageWriterFactory.java:44) > at org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:70) > at > org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:137) > at > org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:120) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > > Is anyone can tell me what's cause this exception, I tried to use > org.apache.hadoop.fs.s3.S3FileSystem to replace > org.apache.hadoop.fs.s3native.NativeS3FileSystem for the conf "fs.s3.impl", > but other exception occur and it seems org.apache.hadoop.fs.s3.S3FileSystem > fit hadoop 2.6. > > Thanks advance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-505) Add unified javadoc to the Hudi Website
[ https://issues.apache.org/jira/browse/HUDI-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-505: - Labels: user-support-issues (was: ) > Add unified javadoc to the Hudi Website > --- > > Key: HUDI-505 > URL: https://issues.apache.org/jira/browse/HUDI-505 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-465) Make Hive Sync via Spark painless
[ https://issues.apache.org/jira/browse/HUDI-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-465: - Labels: help-wanted starter user-support-issues (was: help-wanted starter) > Make Hive Sync via Spark painless > - > > Key: HUDI-465 > URL: https://issues.apache.org/jira/browse/HUDI-465 > Project: Apache Hudi > Issue Type: New Feature > Components: Hive Integration, Spark Integration, Usability >Reporter: Vinoth Chandar >Assignee: liwei >Priority: Major > Labels: help-wanted, starter, user-support-issues > > Currently, we require many configs to be passed in for the Hive sync.. this > has to be simplified and experience should be close to how regular > spark.write.parquet registers into Hive.. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-396) Provide an documentation to describe how to use test suite
[ https://issues.apache.org/jira/browse/HUDI-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-396: - Labels: user-support-issues (was: ) > Provide an documentation to describe how to use test suite > -- > > Key: HUDI-396 > URL: https://issues.apache.org/jira/browse/HUDI-396 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: vinoyang >Assignee: wangxianghu >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-632) Update documentation (docker_demo) to mention both commit and deltacommit files
[ https://issues.apache.org/jira/browse/HUDI-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-632. -- Fix Version/s: 0.7.0 Resolution: Fixed > Update documentation (docker_demo) to mention both commit and deltacommit > files > > > Key: HUDI-632 > URL: https://issues.apache.org/jira/browse/HUDI-632 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Vikrant Goel >Priority: Minor > Labels: pull-request-available > Fix For: 0.7.0 > > Time Spent: 20m > Remaining Estimate: 0h > > In the demo, we could have commit or deltacommit files created depending on > the type of table. Updating it will help avoid potential confusion. > [https://hudi.incubator.apache.org/docs/docker_demo.html#step-2-incrementally-ingest-data-from-kafka-topic] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes
[ https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-648: - Labels: user-support-issues (was: ) > Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction > writes > > > Key: HUDI-648 > URL: https://issues.apache.org/jira/browse/HUDI-648 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer, Spark Integration, Writer Core >Reporter: Vinoth Chandar >Priority: Major > Labels: user-support-issues > > We would like a way to hand the erroring records from writing or compaction > back to the users, in a separate table or log. This needs to work generically > across all the different writer paths. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results
[ https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-651: - Labels: pull-request-available user-support-issues (was: pull-request-available) > Incremental Query on Hive via Spark SQL does not return expected results > > > Key: HUDI-651 > URL: https://issues.apache.org/jira/browse/HUDI-651 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available, user-support-issues > Fix For: 0.8.0 > > > Using the docker demo, I added two delta commits to a MOR table and was a > hoping to incremental consume them like Hive QL.. Something amiss > {code} > scala> > spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147") > scala> > spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL") > scala> spark.sql("select distinct `_hoodie_commit_time` from > stock_ticks_mor_rt").show(100, false) > +---+ > |_hoodie_commit_time| > +---+ > |20200302210010 | > |20200302210147 | > +---+ > scala> sc.setLogLevel("INFO") > scala> spark.sql("select distinct `_hoodie_commit_time` from > stock_ticks_mor_rt").show(100, false) > 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: > spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current > version of codegened fast hashmap does not support this aggregate. > 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: > spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current > version of codegened fast hashmap does not support this aggregate. > 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as > values in memory (estimated size 292.3 KB, free 365.3 MB) > 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored > as bytes in memory (estimated size 25.4 KB, free 365.3 MB) > 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in > memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB) > 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie > metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading > HoodieTableMetaClient from > hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: > [hdfs://namenode:8020], Config:[Configuration: core-default.xml, > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, > yarn-site.xml, hdfs-default.xml, hdfs-site.xml, > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, > file:/etc/hadoop/hive-site.xml], FileSystem: > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root > (auth:SIMPLE)]]] > 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from > hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of > type MERGE_ON_READ(version=1) from > hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : > 1 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 > groups > 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants > [[20200302210010__clean__COMPLETED], > [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], > [20200302210147__deltacommit__COMPLETED]] > 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for > partition :2018/08/31, #FileGroups=1 > 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: > NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to > process after hoodie filter 1 > 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie > metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading > HoodieTableMetaClient from > hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor > 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: > [hdfs://namenode:8020], Config:[Configuration: core-default.xml, > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, > yarn-site.xml, hdfs-default.xml, hdfs-site.xml, >
[jira] [Updated] (HUDI-691) hoodie.*.consume.* should be set whitelist in hive-site.xml
[ https://issues.apache.org/jira/browse/HUDI-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-691: - Labels: user-support-issues (was: ) > hoodie.*.consume.* should be set whitelist in hive-site.xml > --- > > Key: HUDI-691 > URL: https://issues.apache.org/jira/browse/HUDI-691 > Project: Apache Hudi > Issue Type: Task > Components: Docs, newbie >Reporter: Bhavani Sudha >Assignee: GarudaGuo >Priority: Minor > Labels: user-support-issues > Fix For: 0.8.0 > > > More details in this GH issue - > https://github.com/apache/incubator-hudi/issues/910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-653) Add JMX Report Config to Doc
[ https://issues.apache.org/jira/browse/HUDI-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-653. -- Fix Version/s: 0.7.0 Resolution: Fixed > Add JMX Report Config to Doc > > > Key: HUDI-653 > URL: https://issues.apache.org/jira/browse/HUDI-653 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Add Jmx Report Config to Doc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-653) Add JMX Report Config to Doc
[ https://issues.apache.org/jira/browse/HUDI-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-653: - Status: In Progress (was: Open) > Add JMX Report Config to Doc > > > Key: HUDI-653 > URL: https://issues.apache.org/jira/browse/HUDI-653 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Add Jmx Report Config to Doc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-653) Add JMX Report Config to Doc
[ https://issues.apache.org/jira/browse/HUDI-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-653: - Status: Open (was: New) > Add JMX Report Config to Doc > > > Key: HUDI-653 > URL: https://issues.apache.org/jira/browse/HUDI-653 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Add Jmx Report Config to Doc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272380#comment-17272380 ] sivabalan narayanan commented on HUDI-718: -- [~afilipchik]: do you still face this issue? > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-718: - Labels: user-support-issues (was: bug-bash-0.6.0) > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-735) Improve deltastreamer error message when case mismatch of commandline arguments.
[ https://issues.apache.org/jira/browse/HUDI-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-735: - Labels: user-support-issues (was: ) > Improve deltastreamer error message when case mismatch of commandline > arguments. > > > Key: HUDI-735 > URL: https://issues.apache.org/jira/browse/HUDI-735 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup, DeltaStreamer, Usability >Reporter: Vinoth Chandar >Assignee: Nicholas Jiang >Priority: Major > Labels: user-support-issues > > Team, > When following the blog "Change Capture Using AWS Database Migration > Service and Hudi" with my own data set, the initial load works perfectly. > When issuing the command with the DMS CDC files on S3, I get the following > error: > {code} > 20/03/24 17:56:28 ERROR HoodieDeltaStreamer: Got error running delta sync > once. Shutting down > org.apache.hudi.exception.HoodieException: Please provide a valid schema > provider class! at > org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) > {code} > I tried using the --schemaprovider-class > org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source and provide > the schema. The error does not occur but there are no write to Hudi. > I am not performing any transformations (other than the DMS transform) and > using default record key strategy. > If the team has any pointers, please let me know. > Thank you! > --- > Thank you Vinoth. I was able to find the issue. All my column names were in > high caps case. I switched column names and table names to lower case and > it works perfectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-767) Support transformation when export to Hudi
[ https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-767: - Labels: user-support-issues (was: ) > Support transformation when export to Hudi > -- > > Key: HUDI-767 > URL: https://issues.apache.org/jira/browse/HUDI-767 > Project: Apache Hudi > Issue Type: Improvement > Components: Utilities >Reporter: Raymond Xu >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > > Main logic described in > https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410 > In HoodieSnapshotExporter, we could extend the feature to include > transformation when --output-format hudi, using a custom Transformer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-776) Document community support triage process
[ https://issues.apache.org/jira/browse/HUDI-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-776: - Labels: user-support-issues (was: ) > Document community support triage process > -- > > Key: HUDI-776 > URL: https://issues.apache.org/jira/browse/HUDI-776 > Project: Apache Hudi > Issue Type: Task > Components: Docs, Release Administrative >Reporter: Vinoth Chandar >Priority: Major > Labels: user-support-issues > > Per thread > https://lists.apache.org/thread.html/r0de5b576ea3db07e663d76d72196404b65f1624c298a6b335229c05d%40%3Cdev.hudi.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-788) Hudi writer and Hudi Reader split
[ https://issues.apache.org/jira/browse/HUDI-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272377#comment-17272377 ] sivabalan narayanan commented on HUDI-788: -- [~hainanzhongjian]: do you mind fixing the translation. > Hudi writer and Hudi Reader split > - > > Key: HUDI-788 > URL: https://issues.apache.org/jira/browse/HUDI-788 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: wangmeng >Priority: Minor > > 举例: > * > 很多公司采用CDH搭建自己集群,但是CDH中对hive的集成版本一直停留在hive-exec-1.1.0。并且生产环境中hive版本升级到2.*很麻烦。顾很多公司的生产环境中hive版本一般都是hive-exec-1.1.0。 > * 采用hudi 0.5.*版本时对hive支持都是2.3以上。这样会导致,用户在使用hudi时,set > hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat时出现问题。 > * 顾设想,是否可以把hudi 写数据到dfs和hive集成 与 从hive中读取hudi数据逻辑分割开来。这样保证,hudi > 0.5.*版本可以集成hive 2.3.*,但是hive读只需要添加相应的hoodie-hadoop-mr-bundle -hive > 版本就可以。可以是hive--exec-1.1.0,也可以是-hive--exec-2.3.* > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider
[ https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-801. -- Fix Version/s: (was: 0.8.0) 0.7.0 Resolution: Fixed > Add a way to postprocess schema after it is loaded from the schema provider > --- > > Key: HUDI-801 > URL: https://issues.apache.org/jira/browse/HUDI-801 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: Alexander Filipchik >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > Sometimes it is needed to postprocess schemas after they are fetched from the > external sources. Some examples of postprocessing: > * make sure all the defaults are set correctly, and update schema if not. > * insert marker columns into records with no fields (no writable as parquest) > * ... > Would be great to have a way to plug in custom post processors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider
[ https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-801: Assignee: Alexander Filipchik > Add a way to postprocess schema after it is loaded from the schema provider > --- > > Key: HUDI-801 > URL: https://issues.apache.org/jira/browse/HUDI-801 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: Alexander Filipchik >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Sometimes it is needed to postprocess schemas after they are fetched from the > external sources. Some examples of postprocessing: > * make sure all the defaults are set correctly, and update schema if not. > * insert marker columns into records with no fields (no writable as parquest) > * ... > Would be great to have a way to plug in custom post processors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider
[ https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-801: - Status: In Progress (was: Open) > Add a way to postprocess schema after it is loaded from the schema provider > --- > > Key: HUDI-801 > URL: https://issues.apache.org/jira/browse/HUDI-801 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Alexander Filipchik >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Sometimes it is needed to postprocess schemas after they are fetched from the > external sources. Some examples of postprocessing: > * make sure all the defaults are set correctly, and update schema if not. > * insert marker columns into records with no fields (no writable as parquest) > * ... > Would be great to have a way to plug in custom post processors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider
[ https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-801: - Status: Open (was: New) > Add a way to postprocess schema after it is loaded from the schema provider > --- > > Key: HUDI-801 > URL: https://issues.apache.org/jira/browse/HUDI-801 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Alexander Filipchik >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Sometimes it is needed to postprocess schemas after they are fetched from the > external sources. Some examples of postprocessing: > * make sure all the defaults are set correctly, and update schema if not. > * insert marker columns into records with no fields (no writable as parquest) > * ... > Would be great to have a way to plug in custom post processors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi
[ https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-824: - Labels: user-support-issues (was: ) > Register hudi-spark package with spark packages repo for easier usage of Hudi > - > > Key: HUDI-824 > URL: https://issues.apache.org/jira/browse/HUDI-824 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: user-support-issues > > At the moment, to be able to use Hudi with spark, users have to do the > following : > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --jars `ls > packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` > \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}} > {{}} > {{Ideally, we want to be able to use Hudi as follows :}} > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages > org.apache.hudi:hudi-spark-bundle: \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-829) Efficiently reading hudi tables through spark-shell
[ https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-829: - Labels: user-support-issues (was: ) > Efficiently reading hudi tables through spark-shell > --- > > Key: HUDI-829 > URL: https://issues.apache.org/jira/browse/HUDI-829 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: user-support-issues > > [~uditme] Created this ticket to track some discussion on read/query path of > spark with Hudi tables. > My understanding is that when you read Hudi tables through spark-shell, some > of your queries are slower due to some sequential activity performed by spark > when interacting with Hudi tables (even with > spark.sql.hive.convertMetastoreParquet which can give you the same data > reading speed and all the vectorization benefits). Is this slowness observed > during spark query planning ? Can you please elaborate on this ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-825) Write a small blog on how to use hudi-spark with pyspark
[ https://issues.apache.org/jira/browse/HUDI-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-825: - Labels: user-support-issues (was: ) > Write a small blog on how to use hudi-spark with pyspark > > > Key: HUDI-825 > URL: https://issues.apache.org/jira/browse/HUDI-825 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-837) Fix AvroKafkaSource to use the latest schema for reading
[ https://issues.apache.org/jira/browse/HUDI-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-837: - Labels: pull-request-available user-support-issues (was: bug-bash-0.6.0 pull-request-available) > Fix AvroKafkaSource to use the latest schema for reading > > > Key: HUDI-837 > URL: https://issues.apache.org/jira/browse/HUDI-837 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > Labels: pull-request-available, user-support-issues > Fix For: 0.8.0 > > > Currently we specify KafkaAvroDeserializer as the value for > value.deserializer in AvroKafkaSource. This implies the published record is > read using the same schema with which it was written even though the schema > got evolved in between. As a result, messages in incoming batch can have > different schemas. This has to be handled at the time of actually writing > records in parquet. > This Jira aims at providing an option to read all the messages with the same > schema by implementing a new custom deserializer class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-873) kafka connector support hudi sink
[ https://issues.apache.org/jira/browse/HUDI-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-873: - Labels: user-support-issues (was: ) > kafka connector support hudi sink > -- > > Key: HUDI-873 > URL: https://issues.apache.org/jira/browse/HUDI-873 > Project: Apache Hudi > Issue Type: Improvement > Components: Utilities >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-851) Add Documentation on partitioning data with examples and details on how to sync to Hive
[ https://issues.apache.org/jira/browse/HUDI-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-851: - Labels: user-support-issues (was: ) > Add Documentation on partitioning data with examples and details on how to > sync to Hive > --- > > Key: HUDI-851 > URL: https://issues.apache.org/jira/browse/HUDI-851 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs, docs-chinese >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-874) Schema evolution does not work with AWS Glue catalog
[ https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-874: - Labels: user-support-issues (was: ) > Schema evolution does not work with AWS Glue catalog > > > Key: HUDI-874 > URL: https://issues.apache.org/jira/browse/HUDI-874 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Priority: Major > Labels: user-support-issues > > This issue has been discussed here > [https://github.com/apache/incubator-hudi/issues/1581] and at other places as > well. Glue catalog currently does not support *cascade* for *ALTER TABLE* > statements. As a result features like adding new columns to an existing table > does now work with glue catalog . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-874) Schema evolution does not work with AWS Glue catalog
[ https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272368#comment-17272368 ] sivabalan narayanan commented on HUDI-874: -- [~uditme]: can you please look into this ticket when you can. > Schema evolution does not work with AWS Glue catalog > > > Key: HUDI-874 > URL: https://issues.apache.org/jira/browse/HUDI-874 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Priority: Major > > This issue has been discussed here > [https://github.com/apache/incubator-hudi/issues/1581] and at other places as > well. Glue catalog currently does not support *cascade* for *ALTER TABLE* > statements. As a result features like adding new columns to an existing table > does now work with glue catalog . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-893) Add spark datasource V2 reader support for Hudi tables
[ https://issues.apache.org/jira/browse/HUDI-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-893: - Labels: user-support-issues (was: ) > Add spark datasource V2 reader support for Hudi tables > -- > > Key: HUDI-893 > URL: https://issues.apache.org/jira/browse/HUDI-893 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Nan Zhu >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-914) support different target data clusters
[ https://issues.apache.org/jira/browse/HUDI-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-914: - Labels: user-support-issues (was: ) > support different target data clusters > -- > > Key: HUDI-914 > URL: https://issues.apache.org/jira/browse/HUDI-914 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: liujinhui >Assignee: liujinhui >Priority: Major > Labels: user-support-issues > Original Estimate: 168h > Remaining Estimate: 168h > > Currently hudi-DeltaStreamer does not support writing to different target > clusters. The specific scenarios are as follows: Generally, Hudi tasks run on > an independent cluster. If you want to write data to the target data cluster, > you generally rely on core-site.xml and hdfs-site.xml; sometimes you will > encounter different targets. The data cluster writes data, but the cluster > running the hudi task does not have the core-site.xml and hdfs-site.xml of > the target cluster. Although specifying the namenode IP address of the target > cluster can be written, this loses HDFS high availability, so I plan to Use > the contents of the core-site.xml and hdfs-site.xml files of the target > cluster as configuration items and configure them in the > dfs-source.properties or kafka-source.properties file of Hudi. > Is there a better way to solve this problem? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1007) When earliestOffsets is greater than checkpoint, Hudi will not be able to successfully consume data
[ https://issues.apache.org/jira/browse/HUDI-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1007: -- Labels: user-support-issues (was: ) > When earliestOffsets is greater than checkpoint, Hudi will not be able to > successfully consume data > --- > > Key: HUDI-1007 > URL: https://issues.apache.org/jira/browse/HUDI-1007 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: liujinhui >Assignee: liujinhui >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Use deltastreamer to consume kafka, > When earliestOffsets is greater than checkpoint, Hudi will not be able to > successfully consume data > org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen#checkupValidOffsets > boolean checkpointOffsetReseter = checkpointOffsets.entrySet().stream() > .anyMatch(offset -> offset.getValue() < > earliestOffsets.get(offset.getKey())); > return checkpointOffsetReseter ? earliestOffsets : checkpointOffsets; > Kafka data is continuously generated, which means that some data will > continue to expire. > When earliestOffsets is greater than checkpoint, earliestOffsets will be > taken. But at this moment, some data expired. In the end, consumption fails. > This process is an endless cycle. I can understand that this design may be to > avoid the loss of data, but it will lead to such a situation, I want to fix > this problem, I want to hear your opinion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] kimberlyamandalu commented on issue #1737: [SUPPORT]spark streaming create small parquet files
kimberlyamandalu commented on issue #1737: URL: https://github.com/apache/hudi/issues/1737#issuecomment-767786456 @nsivabalan I am also seeing the same behavior for my workload. Compaction seems to be occurring for my MOR table as per hoodie.compact.inline.max.delta.commits. However, cleaner does not seem to get triggered even after setting hoodie.cleaner.commits.retained. I don't see any cleans requested in the .hoodie folder. Can we follow up on this? Is anyone still observing the same problem? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhedoubushishi opened a new pull request #2493: [WIP] Change another way to convert Path with Scheme
zhedoubushishi opened a new pull request #2493: URL: https://github.com/apache/hudi/pull/2493 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2492: [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism
codecov-io commented on pull request #2492: URL: https://github.com/apache/hudi/pull/2492#issuecomment-767752039 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=h1) Report > Merging [#2492](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=desc) (2e615c3) into [master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc) (c4afd17) will **increase** coverage by `19.29%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2492/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2492 +/- ## = + Coverage 50.18% 69.48% +19.29% + Complexity 3050 358 -2692 = Files 419 53 -366 Lines 18931 1930-17001 Branches 1948 230 -1718 = - Hits 9500 1341 -8159 + Misses 8656 456 -8200 + Partials775 133 -642 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../main/java/org/apache/hudi/util/AvroConvertor.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL0F2cm9Db252ZXJ0b3IuamF2YQ==) | | | | | [...he/hudi/common/table/log/block/HoodieLogBlock.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVMb2dCbG9jay5qYXZh) | | | | | [...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=) | | | | | [...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=) | | | | | [...e/hudi/common/engine/HoodieLocalEngineContext.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Ib29kaWVMb2NhbEVuZ2luZUNvbnRleHQuamF2YQ==) | | | | | [.../org/apache/hudi/exception/HoodieKeyException.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUtleUV4Y2VwdGlvbi5qYXZh) | | | | | [...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=) | | | | | [...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZUlucHV0Rm9ybWF0VXRpbHMuamF2YQ==) | | | | | [...i/hive/SlashEncodedDayPartitionValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvU2xhc2hFbmNvZGVkRGF5UGFydGl0aW9uVmFsdWVFeHRyYWN0b3IuamF2YQ==) | | | | | [.../apache/hudi/timeline/service/TimelineService.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvVGltZWxpbmVTZXJ2aWNlLmphdmE=) | | | | | ... and [356 more](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree-more) | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys
vinothchandar commented on issue #2013: URL: https://github.com/apache/hudi/issues/2013#issuecomment-767720824 I think aws has to support/recompile against their spark version. cc @umehrot2 for now, you can test using apache spark ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] fripple commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys
fripple commented on issue #2013: URL: https://github.com/apache/hudi/issues/2013#issuecomment-767706110 Yes, I'm using spark as provided by AWS. Is there any way to make this work or am I out of luck until AWS EMR supports hudi 0.7? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys
vinothchandar commented on issue #2013: URL: https://github.com/apache/hudi/issues/2013#issuecomment-767703987 This error seems to be due to using the aws spark distro? This change would work with any table written using previous versions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar merged pull request #2491: [MINOR] Update doap with 0.7.0 release
vinothchandar merged pull request #2491: URL: https://github.com/apache/hudi/pull/2491 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Update doap with 0.7.0 release (#2491)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c8ee40f [MINOR] Update doap with 0.7.0 release (#2491) c8ee40f is described below commit c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1 Author: vinoth chandar AuthorDate: Tue Jan 26 09:28:22 2021 -0800 [MINOR] Update doap with 0.7.0 release (#2491) --- doap_HUDI.rdf | 5 + 1 file changed, 5 insertions(+) diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf index 77db135..06d5128 100644 --- a/doap_HUDI.rdf +++ b/doap_HUDI.rdf @@ -61,6 +61,11 @@ 2020-08-22 0.6.0 + +Apache Hudi 0.7.0 +2021-01-25 +0.7.0 +
[jira] [Commented] (HUDI-1288) DeltaSync:writeToSink fails with Unknown datum type org.apache.avro.JsonProperties$Null
[ https://issues.apache.org/jira/browse/HUDI-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272267#comment-17272267 ] Vinoth Chandar commented on HUDI-1288: -- https://cwiki.apache.org/confluence/display/HUDI/Release+Management talks about this in more detail. We are not planning on doing backports, rather we want to make rolling forward to a newer release much easier/smoother. > DeltaSync:writeToSink fails with Unknown datum type > org.apache.avro.JsonProperties$Null > --- > > Key: HUDI-1288 > URL: https://issues.apache.org/jira/browse/HUDI-1288 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Michal Swiatowy >Priority: Major > Labels: user-support-issues > > After updating to Hudi version 0.5.3 (prev. 0.5.2-incubating) I run into > following error message on write to HDFS: > {code:java} > 2020-09-18 12:54:38,651 [Driver] INFO > HoodieTableMetaClient:initTableAndGetMetaClient:379 - Finished initializing > Table of type MERGE_ON_READ from > /master_data/6FQS/hudi_test/S_INCOMINGMESSAGEDETAIL_CDC > 2020-09-18 12:54:38,663 [Driver] INFO DeltaSync:setupWriteClient:470 - > Setting up Hoodie Write Client > 2020-09-18 12:54:38,695 [Driver] INFO DeltaSync:registerAvroSchemas:522 - > Registering Schema >
[jira] [Resolved] (HUDI-1029) Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-1029. --- Assignee: Pratyaksh Sharma Resolution: Invalid > Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator > > > Key: HUDI-1029 > URL: https://issues.apache.org/jira/browse/HUDI-1029 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: steven zhang >Assignee: Pratyaksh Sharma >Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > > 1. in TimestampBasedKeyGenerator#getKey method, generate a HoodieKey will > create a new SimpleDateFormat object,the dateformat object can be reused as > class variable. > 2. SimpleDateFormat is not thread safe,there is always potential thread safe > problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1024) Document S3 related guide and tips
[ https://issues.apache.org/jira/browse/HUDI-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1024: -- Labels: documentation user-support-issues (was: documentation) > Document S3 related guide and tips > -- > > Key: HUDI-1024 > URL: https://issues.apache.org/jira/browse/HUDI-1024 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Raymond Xu >Priority: Minor > Labels: documentation, user-support-issues > Fix For: 0.8.0 > > > Create a section in docs website for Hudi on S3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1022) Document examples for Spark structured streaming writing into Hudi
[ https://issues.apache.org/jira/browse/HUDI-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1022: -- Labels: user-support-issues (was: ) > Document examples for Spark structured streaming writing into Hudi > -- > > Key: HUDI-1022 > URL: https://issues.apache.org/jira/browse/HUDI-1022 > Project: Apache Hudi > Issue Type: Task > Components: Usability >Reporter: Bhavani Sudha >Assignee: Felix Kizhakkel Jose >Priority: Minor > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1020) Making timeline server as an external long running service and extending it to be able to plugin business metadata
[ https://issues.apache.org/jira/browse/HUDI-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1020: -- Labels: user-support-issues (was: ) > Making timeline server as an external long running service and extending it > to be able to plugin business metadata > --- > > Key: HUDI-1020 > URL: https://issues.apache.org/jira/browse/HUDI-1020 > Project: Apache Hudi > Issue Type: Improvement > Components: Usability >Reporter: Bhavani Sudha >Priority: Major > Labels: user-support-issues > > Based on the description in the mailing thread - > [https://www.mail-archive.com/dev@hudi.apache.org/msg02917.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1029) Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1029: -- Status: Open (was: New) > Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator > > > Key: HUDI-1029 > URL: https://issues.apache.org/jira/browse/HUDI-1029 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: steven zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > > 1. in TimestampBasedKeyGenerator#getKey method, generate a HoodieKey will > create a new SimpleDateFormat object,the dateformat object can be reused as > class variable. > 2. SimpleDateFormat is not thread safe,there is always potential thread safe > problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1029) Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1029: -- Status: In Progress (was: Open) > Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator > > > Key: HUDI-1029 > URL: https://issues.apache.org/jira/browse/HUDI-1029 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: steven zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > > 1. in TimestampBasedKeyGenerator#getKey method, generate a HoodieKey will > create a new SimpleDateFormat object,the dateformat object can be reused as > class variable. > 2. SimpleDateFormat is not thread safe,there is always potential thread safe > problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1292) [Umbrella] RFC-15 : File Listing and Query Planning Optimizations
[ https://issues.apache.org/jira/browse/HUDI-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272266#comment-17272266 ] Vinoth Chandar commented on HUDI-1292: -- [~shivnarayan] why is this marked as a user-support-issue? > [Umbrella] RFC-15 : File Listing and Query Planning Optimizations > -- > > Key: HUDI-1292 > URL: https://issues.apache.org/jira/browse/HUDI-1292 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration, Writer Core >Reporter: Vinoth Chandar >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available, user-support-issues > Fix For: 0.8.0 > > > This is the umbrella ticket that tracks the overall implementation of RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1058) Make delete marker configurable
[ https://issues.apache.org/jira/browse/HUDI-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1058: -- Labels: pull-request-available user-support-issues (was: pull-request-available) > Make delete marker configurable > --- > > Key: HUDI-1058 > URL: https://issues.apache.org/jira/browse/HUDI-1058 > Project: Apache Hudi > Issue Type: Improvement > Components: Usability >Reporter: Raymond Xu >Assignee: shenh062326 >Priority: Major > Labels: pull-request-available, user-support-issues > > users can specify any boolean field for delete marker and > `_hoodie_is_deleted` remains as default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
[ https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272264#comment-17272264 ] sivabalan narayanan commented on HUDI-1036: --- [~nishith29]: any follow up on this. > HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit > --- > > Key: HUDI-1036 > URL: https://issues.apache.org/jira/browse/HUDI-1036 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Bhavani Sudha >Assignee: Nishith Agarwal >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > > Opening this Jira based on the GitHub issue reported here - > [https://github.com/apache/hudi/issues/1735] when hive.input.format = > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to > create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub > issue more details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
[ https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1036: -- Labels: user-support-issues (was: ) > HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit > --- > > Key: HUDI-1036 > URL: https://issues.apache.org/jira/browse/HUDI-1036 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Bhavani Sudha >Assignee: Nishith Agarwal >Priority: Major > Labels: user-support-issues > Fix For: 0.8.0 > > > Opening this Jira based on the GitHub issue reported here - > [https://github.com/apache/hudi/issues/1735] when hive.input.format = > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to > create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub > issue more details. -- This message was sent by Atlassian Jira (v8.3.4#803005)