[GitHub] [hudi] bvaradar commented on issue #2346: [SUPPORT]The rt view query returns a wrong result with predicate push down.

2021-01-26 Thread GitBox


bvaradar commented on issue #2346:
URL: https://github.com/apache/hudi/issues/2346#issuecomment-768102437


   Closing this GH as we have a jira to track it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2346: [SUPPORT]The rt view query returns a wrong result with predicate push down.

2021-01-26 Thread GitBox


bvaradar closed issue #2346:
URL: https://github.com/apache/hudi/issues/2346


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on pull request #2485: [HUDI-1109] Support Spark Structured Streaming read from Hudi table

2021-01-26 Thread GitBox


zhedoubushishi commented on pull request #2485:
URL: https://github.com/apache/hudi/pull/2485#issuecomment-768083458


   Can you check if this change is compatible with Spark 3.0.0?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2495: [HUDI-1553] Configuration and metrics for the TimelineService.

2021-01-26 Thread GitBox


codecov-io commented on pull request #2495:
URL: https://github.com/apache/hudi/pull/2495#issuecomment-768052489


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=h1) Report
   > Merging 
[#2495](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=desc) (3397ed3) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc)
 (c8ee40f) will **increase** coverage by `11.10%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2495/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2495   +/-   ##
   =
   + Coverage 50.18%   61.29%   +11.10% 
   + Complexity 3051  318 -2733 
   =
 Files   419   53  -366 
 Lines 18931 1930-17001 
 Branches   1948  230 -1718 
   =
   - Hits   9501 1183 -8318 
   + Misses 8656  623 -8033 
   + Partials774  124  -650 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `61.29% <ø> (-8.19%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2495?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | |
   | 
[...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh)
 | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh)
 | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `40.46% <0.00%> (-23.70%)` | `27.00% <0.00%> (-6.00%)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
   | 
[.../hadoop/utils/HoodieRealtimeRecordReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lUmVjb3JkUmVhZGVyVXRpbHMuamF2YQ==)
 | | | |
   | 
[.../hudi/hadoop/realtime/HoodieRealtimeFileSplit.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZVJlYWx0aW1lRmlsZVNwbGl0LmphdmE=)
 | | | |
   | 
[...rg/apache/hudi/common/fs/inline/InLineFSUtils.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9JbkxpbmVGU1V0aWxzLmphdmE=)
 | | | |
   | 
[...e/hudi/common/util/collection/ImmutableTriple.java](https://codecov.io/gh/apache/hudi/pull/2495/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9JbW11dGFibGVUcmlwbGUuamF2YQ==)
 | | | |
   | ... and [360 

[GitHub] [hudi] danny0405 commented on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-768022341


   > @danny0405 sorry for the delay on review, I was super busy this week. The 
bloom index was merged to master, can we add the bloom index option to this PR 
as well?
   
   I'm not planning to using the BloomFilter index in the new pipeline, instead 
there is a BloomFilter index backed state index in the following PR, which is 
more suitable for streaming write.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565021738



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java
##
@@ -0,0 +1,413 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.memory.DataInputViewStreamWrapper;
+import org.apache.flink.core.memory.DataOutputViewStreamWrapper;
+import org.apache.flink.runtime.jobgraph.OperatorID;
+import org.apache.flink.runtime.operators.coordination.OperatorCoordinator;
+import org.apache.flink.runtime.operators.coordination.OperatorEvent;
+import org.apache.flink.util.Preconditions;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.jetbrains.annotations.Nullable;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.CompletionException;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+/**
+ * {@link OperatorCoordinator} for {@link StreamWriteFunction}.
+ *
+ * This coordinator starts a new instant when a new checkpoint starts. It 
commits the instant when all the
+ * operator tasks write the buffer successfully for a round of checkpoint.
+ *
+ * If there is no data for a round of checkpointing, it rolls back the 
metadata.
+ *
+ * @see StreamWriteFunction for the work flow and semantics
+ */
+public class StreamWriteOperatorCoordinator
+implements OperatorCoordinator {
+  private static final Logger LOG = 
LoggerFactory.getLogger(StreamWriteOperatorCoordinator.class);
+
+  /**
+   * Config options.
+   */
+  private final Configuration conf;
+
+  /**
+   * Write client.
+   */
+  private transient HoodieFlinkWriteClient writeClient;
+
+  private long inFlightCheckpoint = -1;
+
+  /**
+   * Current REQUESTED instant, for validation.
+   */
+  private String inFlightInstant = "";
+
+  /**
+   * Event buffer for one round of checkpointing. When all the elements are 
non-null and have the same
+   * write instant, then the instant succeed and we can commit it.
+   */
+  private transient BatchWriteSuccessEvent[] eventBuffer;
+
+  /**
+   * Task number of the operator.
+   */
+  private final int parallelism;
+
+  /**
+   * Constructs a StreamingSinkOperatorCoordinator.
+   *
+   * @param confThe config options
+   * @param parallelism The operator task number
+   */
+  public StreamWriteOperatorCoordinator(
+  Configuration conf,
+  int parallelism) {
+this.conf = conf;
+this.parallelism = parallelism;
+  }
+
+  @Override
+  public void start() throws Exception {
+// initialize event buffer
+reset();
+// writeClient
+initWriteClient();
+// init table, create it if not exists.
+initTable();
+  }
+
+  @Override
+  public void close() {
+if (writeClient != null) {
+  writeClient.close();
+}
+this.eventBuffer = null;
+  }
+
+  @Override
+  public void checkpointCoordinator(long checkpointId, 
CompletableFuture result) {
+try {
+  final String errMsg = "A new checkpoint starts while the last checkpoint 
buffer"
+

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565020444



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java
##
@@ -0,0 +1,413 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.memory.DataInputViewStreamWrapper;
+import org.apache.flink.core.memory.DataOutputViewStreamWrapper;
+import org.apache.flink.runtime.jobgraph.OperatorID;
+import org.apache.flink.runtime.operators.coordination.OperatorCoordinator;
+import org.apache.flink.runtime.operators.coordination.OperatorEvent;
+import org.apache.flink.util.Preconditions;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.jetbrains.annotations.Nullable;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.CompletionException;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+/**
+ * {@link OperatorCoordinator} for {@link StreamWriteFunction}.
+ *
+ * This coordinator starts a new instant when a new checkpoint starts. It 
commits the instant when all the
+ * operator tasks write the buffer successfully for a round of checkpoint.
+ *
+ * If there is no data for a round of checkpointing, it rolls back the 
metadata.
+ *
+ * @see StreamWriteFunction for the work flow and semantics
+ */
+public class StreamWriteOperatorCoordinator
+implements OperatorCoordinator {
+  private static final Logger LOG = 
LoggerFactory.getLogger(StreamWriteOperatorCoordinator.class);
+
+  /**
+   * Config options.
+   */
+  private final Configuration conf;
+
+  /**
+   * Write client.
+   */
+  private transient HoodieFlinkWriteClient writeClient;
+
+  private long inFlightCheckpoint = -1;
+
+  /**
+   * Current REQUESTED instant, for validation.
+   */
+  private String inFlightInstant = "";
+
+  /**
+   * Event buffer for one round of checkpointing. When all the elements are 
non-null and have the same
+   * write instant, then the instant succeed and we can commit it.
+   */
+  private transient BatchWriteSuccessEvent[] eventBuffer;
+
+  /**
+   * Task number of the operator.
+   */
+  private final int parallelism;
+
+  /**
+   * Constructs a StreamingSinkOperatorCoordinator.
+   *
+   * @param confThe config options
+   * @param parallelism The operator task number
+   */
+  public StreamWriteOperatorCoordinator(
+  Configuration conf,
+  int parallelism) {
+this.conf = conf;
+this.parallelism = parallelism;
+  }
+
+  @Override
+  public void start() throws Exception {
+// initialize event buffer
+reset();
+// writeClient
+initWriteClient();
+// init table, create it if not exists.
+initTable();
+  }
+
+  @Override
+  public void close() {
+if (writeClient != null) {
+  writeClient.close();
+}
+this.eventBuffer = null;
+  }
+
+  @Override
+  public void checkpointCoordinator(long checkpointId, 
CompletableFuture result) {
+try {
+  final String errMsg = "A new checkpoint starts while the last checkpoint 
buffer"
+

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565019061



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions
+  .key("write.table.type")
+  .stringType()
+  .defaultValue("COPY_ON_WRITE")
+  .withDescription("Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ");
+
+  public static final ConfigOption OPERATION = ConfigOptions
+  .key("write.operation")
+  .stringType()
+  .defaultValue("upsert")
+  .withDescription("The write operation, that this write should do");
+
+  public static final ConfigOption PRECOMBINE_FIELD = ConfigOptions
+  .key("write.precombine.field")
+  .stringType()
+  .defaultValue("ts")
+  .withDescription("Field used in preCombining before actual write. When 
two records have the same\n"
+  + "key value, we will pick the one with the largest value for the 
precombine field,\n"
+  + "determined by Object.compareTo(..)");
+
+  public static final ConfigOption PAYLOAD_CLASS = ConfigOptions
+  .key("write.payload.class")
+  .stringType()
+  .defaultValue(OverwriteWithLatestAvroPayload.class.getName())
+  .withDescription("Payload class used. Override this, if you like to roll 
your own merge logic, when upserting/inserting.\n"
+  + "This will render any value set for the option in-effective");
+
+  /**
+   * Flag to indicate whether to drop duplicates upon insert.
+   * By default insert will accept duplicates, to gain extra performance.
+   */
+  public static final ConfigOption INSERT_DROP_DUPS = ConfigOptions
+  .key("write.insert.drop.duplicates")
+  .booleanType()
+  .defaultValue(false)
+  .withDescription("Flag to indicate whether to drop duplicates upon 
insert.\n"
+  + "By default insert will accept duplicates, to 

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565017180



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions
+  .key("write.table.type")
+  .stringType()
+  .defaultValue("COPY_ON_WRITE")
+  .withDescription("Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ");
+
+  public static final ConfigOption OPERATION = ConfigOptions
+  .key("write.operation")
+  .stringType()
+  .defaultValue("upsert")

Review comment:
   No, see `WriteOperationType#fromValue`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565017180



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions
+  .key("write.table.type")
+  .stringType()
+  .defaultValue("COPY_ON_WRITE")
+  .withDescription("Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ");
+
+  public static final ConfigOption OPERATION = ConfigOptions
+  .key("write.operation")
+  .stringType()
+  .defaultValue("upsert")

Review comment:
   not see `WriteOperationType#fromValue`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565017180



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions
+  .key("write.table.type")
+  .stringType()
+  .defaultValue("COPY_ON_WRITE")
+  .withDescription("Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ");
+
+  public static final ConfigOption OPERATION = ConfigOptions
+  .key("write.operation")
+  .stringType()
+  .defaultValue("upsert")

Review comment:
   not now ~





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r565016901



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions
+  .key("write.table.type")
+  .stringType()
+  .defaultValue("COPY_ON_WRITE")
+  .withDescription("Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ");

Review comment:
   Replace with `,` instead.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-26 Thread GitBox


garyli1019 commented on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-767994932


   @danny0405 sorry for the delay on review, I was super busy this week. The 
bloom index was merged to master, can we add the bloom index option to this PR 
as well?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on pull request #2496: [HUDI-1554] Introduced buffering for streams in HUDI.

2021-01-26 Thread GitBox


prashantwason commented on pull request #2496:
URL: https://github.com/apache/hudi/pull/2496#issuecomment-767988430


   @n3nash  Please review as this may provide benefits for HDFS workloads. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1554) Introduce buffering for streams in HUDI

2021-01-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1554:
-
Labels: pull-request-available  (was: )

> Introduce buffering for streams in HUDI
> ---
>
> Key: HUDI-1554
> URL: https://issues.apache.org/jira/browse/HUDI-1554
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>
> Input and Output streams created in HUDI through calls to 
> HoodieWrapperFileSystem do not include any buffering unless the underlying 
> file system implements buffering.
> DistributedFileSystem (over HDFS) does not implement any buffering. This 
> leads to very large number of small-sized IO calls being send to the HDFS 
> while performing HUDI IO operations like reading parquet, writing parquet, 
> reading/writing log files, reading/writing instants, etc. 
> This patch introduces buffering at the HoodieWrapperFileSystem level so that 
> all types of reads and writes benefit from buffering.
>  
> In my tests with at scale on HDFS writing 1million records into a parquet 
> file (read from an existing parquet file in the same dataset), I observed the 
> following benefits:
>  # about 40% reduction in total time to run the test  
>  # Total write calls to HDFS reduced from 19.1M -> 328
>  # Total read calls reduced from 229M -> 515K
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason opened a new pull request #2496: [HUDI-1554] Introduced buffering for streams in HUDI.

2021-01-26 Thread GitBox


prashantwason opened a new pull request #2496:
URL: https://github.com/apache/hudi/pull/2496


   
   ## What is the purpose of the pull request
   
   Input and Output streams created in HUDI through calls to 
HoodieWrapperFileSystem do not include any buffering unless the underlying file 
system implements buffering.
   
   This patch introduces buffering at the HoodieWrapperFileSystem level so that 
all types of reads and writes benefit from buffering.
   
   ## Brief change log
   
   HoodieWrapperFileSystem changed to introduce BufferedStreams.
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1554) Introduce buffering for streams in HUDI

2021-01-26 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1554:


 Summary: Introduce buffering for streams in HUDI
 Key: HUDI-1554
 URL: https://issues.apache.org/jira/browse/HUDI-1554
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason


Input and Output streams created in HUDI through calls to 
HoodieWrapperFileSystem do not include any buffering unless the underlying file 
system implements buffering.

DistributedFileSystem (over HDFS) does not implement any buffering. This leads 
to very large number of small-sized IO calls being send to the HDFS while 
performing HUDI IO operations like reading parquet, writing parquet, 
reading/writing log files, reading/writing instants, etc. 

This patch introduces buffering at the HoodieWrapperFileSystem level so that 
all types of reads and writes benefit from buffering.

 

In my tests with at scale on HDFS writing 1million records into a parquet file 
(read from an existing parquet file in the same dataset), I observed the 
following benefits:
 # about 40% reduction in total time to run the test  
 # Total write calls to HDFS reduced from 19.1M -> 328
 # Total read calls reduced from 229M -> 515K

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-01-26 Thread GitBox


vinothchandar commented on pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#issuecomment-767962764


   >The size of the base file was 3MB so this means that the in-memory HFile 
block caching was also working.
   
   Trying to understand this part. Was the workload, trying to fetch all the 
keys out of the HFile or just 1?  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-01-26 Thread GitBox


codecov-io commented on pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#issuecomment-767956391


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=h1) Report
   > Merging 
[#2494](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=desc) (19894f6) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc)
 (c8ee40f) will **decrease** coverage by `40.49%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2494/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2494   +/-   ##
   
   - Coverage 50.18%   9.68%   -40.50% 
   + Complexity 3051  48 -3003 
   
 Files   419  53  -366 
 Lines 189311930-17001 
 Branches   1948 230 -1718 
   
   - Hits   9501 187 -9314 
   + Misses 86561730 -6926 
   + Partials774  13  -761 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.80%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2494?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%> 

[jira] [Updated] (HUDI-1553) Add configs for TimelineServer to configure Jetty

2021-01-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1553:
-
Labels: pull-request-available  (was: )

> Add configs for TimelineServer to configure Jetty
> -
>
> Key: HUDI-1553
> URL: https://issues.apache.org/jira/browse/HUDI-1553
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>
> TimelineServer uses Javalin which is based on Jetty.
> By default Jetty:
>  * Has 200 threads
>  * Compresses output by gzip
>  * Handles each request sequentially
>  
> On a large-scale HUDI dataset (2000 partitions), when TimelineServer is 
> enabled, the operations slow down due to following reasons:
>  # Driver process usually has a few cores. 200 Jetty threads lead to huge 
> contention when 100s of executors connect to the Server in parallel.
>  # To handle large number of requests in parallel, its better to handle each 
> HTTP request in an asynchronous manner using Futures which are supported by 
> Javalin.
>  # The compute overhead of gzipping may not be necessary when the executors 
> and driver are in the same rack or within the same datacenter 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason opened a new pull request #2495: [HUDI-1553] Configuration and metrics for the TimelineService.

2021-01-26 Thread GitBox


prashantwason opened a new pull request #2495:
URL: https://github.com/apache/hudi/pull/2495


   ## What is the purpose of the pull request
   
   TimelineServer uses Javalin which is based on Jetty.
   
   By default Jetty:
   
   Has 200 threads
   Compresses output by gzip
   Handles each request sequentially

   
   On a large-scale HUDI dataset (2000 partitions), when TimelineServer is 
enabled, the operations slow down due to following reasons:
   
- Driver process usually has a few cores. 200 Jetty threads lead to huge 
contention when 100s of executors connect to the Server in parallel.
- To handle large number of requests in parallel, its better to handle each 
HTTP request in an asynchronous manner using Futures which are supported by 
Javalin.
- The compute overhead of gzipping may not be necessary when the executors 
and driver are in the same rack or within the same datacenter 
   
   ## Brief change log
   
   Added settings to control the number of threads created, whether to gzip 
output and to use asynchronous processing of requests. 
   
   With all the settings enabled, a driver process with 8 cores is able to 
handle 1024 executors in parallel on a table with 2000 partitions (CLEAN 
operation which lists all partitions). The time per API requests was also 
reduced from 800msec to 60msec.
   
   
   ## Verify this pull request
   
   
   This pull request is already covered by existing tests, such as 
TimelineServer tests and integration tests.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1553) Add configs for TimelineServer to configure Jetty

2021-01-26 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1553:
-
Status: In Progress  (was: Open)

> Add configs for TimelineServer to configure Jetty
> -
>
> Key: HUDI-1553
> URL: https://issues.apache.org/jira/browse/HUDI-1553
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> TimelineServer uses Javalin which is based on Jetty.
> By default Jetty:
>  * Has 200 threads
>  * Compresses output by gzip
>  * Handles each request sequentially
>  
> On a large-scale HUDI dataset (2000 partitions), when TimelineServer is 
> enabled, the operations slow down due to following reasons:
>  # Driver process usually has a few cores. 200 Jetty threads lead to huge 
> contention when 100s of executors connect to the Server in parallel.
>  # To handle large number of requests in parallel, its better to handle each 
> HTTP request in an asynchronous manner using Futures which are supported by 
> Javalin.
>  # The compute overhead of gzipping may not be necessary when the executors 
> and driver are in the same rack or within the same datacenter 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1553) Add configs for TimelineServer to configure Jetty

2021-01-26 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1553:


 Summary: Add configs for TimelineServer to configure Jetty
 Key: HUDI-1553
 URL: https://issues.apache.org/jira/browse/HUDI-1553
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason


TimelineServer uses Javalin which is based on Jetty.

By default Jetty:
 * Has 200 threads
 * Compresses output by gzip
 * Handles each request sequentially

 

On a large-scale HUDI dataset (2000 partitions), when TimelineServer is 
enabled, the operations slow down due to following reasons:
 # Driver process usually has a few cores. 200 Jetty threads lead to huge 
contention when 100s of executors connect to the Server in parallel.
 # To handle large number of requests in parallel, its better to handle each 
HTTP request in an asynchronous manner using Futures which are supported by 
Javalin.
 # The compute overhead of gzipping may not be necessary when the executors and 
driver are in the same rack or within the same datacenter 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-io commented on pull request #2493: [WIP] Change another way to convert Path with Scheme

2021-01-26 Thread GitBox


codecov-io commented on pull request #2493:
URL: https://github.com/apache/hudi/pull/2493#issuecomment-767881117


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=h1) Report
   > Merging 
[#2493](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=desc) (ede32d2) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1?el=desc)
 (c8ee40f) will **decrease** coverage by `40.49%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2493/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2493   +/-   ##
   
   - Coverage 50.18%   9.68%   -40.50% 
   + Complexity 3051  48 -3003 
   
 Files   419  53  -366 
 Lines 189311930-17001 
 Branches   1948 230 -1718 
   
   - Hits   9501 187 -9314 
   + Misses 86561730 -6926 
   + Partials774  13  -761 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.80%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2493?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%> 

[jira] [Updated] (HUDI-1552) Improve performance of key lookups from base file (HFile) in Metadata table

2021-01-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1552:
-
Labels: pull-request-available  (was: )

> Improve performance of key lookups from base file (HFile) in Metadata table
> ---
>
> Key: HUDI-1552
> URL: https://issues.apache.org/jira/browse/HUDI-1552
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason opened a new pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-01-26 Thread GitBox


prashantwason opened a new pull request #2494:
URL: https://github.com/apache/hudi/pull/2494


   
   ## What is the purpose of the pull request
   
   Improves the performance of key lookups from Metadata Table. 
   
   In my scale testing with 150 partitions and 100K+ files on HDFS, the time to 
read the key was reduced (100ms avg -> 10ms) and the total data read from the 
HFile was reduced (85MB -> 3MB). The size of the base file was 3MB so this 
means that the in-memory HFile block caching was also working. 
   
   ## Brief change log
   
   1. Cache the KeyScanner across lookups so that the HFile index does not have 
to be read for each lookup.
   2. Enable block caching in KeyScanner.
   3. Move the lock to a limited scope of the code to reduce lock contention.
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   mvn test -pl  hudi-client/hudi-spark-client -Dtest=TestHoodieBackedMetadata
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1552) Improve performance of key lookups from base file (HFile) in Metadata table

2021-01-26 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1552:


 Summary: Improve performance of key lookups from base file (HFile) 
in Metadata table
 Key: HUDI-1552
 URL: https://issues.apache.org/jira/browse/HUDI-1552
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason
Assignee: Prashant Wason






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] kimberlyamandalu commented on issue #1977: Error running hudi on aws glue

2021-01-26 Thread GitBox


kimberlyamandalu commented on issue #1977:
URL: https://github.com/apache/hudi/issues/1977#issuecomment-767812424


   Quick share: 
https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on pull request #2485: [HUDI-1109] Support Spark Structured Streaming read from Hudi table

2021-01-26 Thread GitBox


zhedoubushishi commented on pull request #2485:
URL: https://github.com/apache/hudi/pull/2485#issuecomment-767810591


   > @pengzhiwei2018 thanks for your contribution. Left some comments but I am 
not quite familiar with Structured streaming. @zhedoubushishi mind taking a 
pass as well?
   
   Sure will take a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-825) Write a small blog on how to use hudi-spark with pyspark

2021-01-26 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-825:
-
Status: Open  (was: New)

> Write a small blog on how to use hudi-spark with pyspark
> 
>
> Key: HUDI-825
> URL: https://issues.apache.org/jira/browse/HUDI-825
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-259) Hadoop 3 support for Hudi writing

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-259:
-
Labels: user-support-issues  (was: bug-bash-0.6.0)

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Wenning Ding
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  
> https://github.com/apache/hudi/issues/1776 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272398#comment-17272398
 ] 

sivabalan narayanan commented on HUDI-281:
--

[~uditme]: is this still a valid issue? 

> HiveSync failure through Spark when useJdbc is set to false
> ---
>
> Key: HUDI-281
> URL: https://issues.apache.org/jira/browse/HUDI-281
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration, Spark Integration, Usability
>Reporter: Udit Mehrotra
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
> *false*. Currently I had to modify the code to set *useJdbc* to *false* as 
> there is not *DataSourceOption* through which I can specify this field when 
> running Hudi code.
> Here is the failure:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
>   at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
> I was expecting this to fail through Spark, becuase *hive-exec* is not shaded 
> inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This 
> *SessionState* is coming from the spark-hive jar and obviously it does not 
> accept the relocated *HiveConf*.
> We in *EMR* are running into same problem when trying to integrate with Glue 
> Catalog. For this we have to create Hive metastore client through 
> *Hive.get(conf).getMsc()* instead of how it is being down now, so that 
> alternate implementations of metastore can get created. However, because 
> hive-exec is not shaded but HiveConf is relocated we run into same issues 
> there.
> It would not be recommended to shade *hive-exec* either because it itself is 
> an Uber jar that shades a lot of things, and all of them would end up in 
> *hudi-spark-bundle* jar. We would not want to head that route. 

[jira] [Updated] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-281:
-
Labels: user-support-issues  (was: bug-bash-0.6.0)

> HiveSync failure through Spark when useJdbc is set to false
> ---
>
> Key: HUDI-281
> URL: https://issues.apache.org/jira/browse/HUDI-281
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration, Spark Integration, Usability
>Reporter: Udit Mehrotra
>Priority: Major
>  Labels: user-support-issues
>
> Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
> *false*. Currently I had to modify the code to set *useJdbc* to *false* as 
> there is not *DataSourceOption* through which I can specify this field when 
> running Hudi code.
> Here is the failure:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
>   at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
> I was expecting this to fail through Spark, becuase *hive-exec* is not shaded 
> inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This 
> *SessionState* is coming from the spark-hive jar and obviously it does not 
> accept the relocated *HiveConf*.
> We in *EMR* are running into same problem when trying to integrate with Glue 
> Catalog. For this we have to create Hive metastore client through 
> *Hive.get(conf).getMsc()* instead of how it is being down now, so that 
> alternate implementations of metastore can get created. However, because 
> hive-exec is not shaded but HiveConf is relocated we run into same issues 
> there.
> It would not be recommended to shade *hive-exec* either because it itself is 
> an Uber jar that shades a lot of things, and all of them would end up in 
> *hudi-spark-bundle* jar. We would not want to head that route. That is why, 
> we 

[jira] [Updated] (HUDI-280) Integrate Hudi to bigtop

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-280:
-
Labels: user-support-issues  (was: )

> Integrate Hudi to bigtop
> 
>
> Key: HUDI-280
> URL: https://issues.apache.org/jira/browse/HUDI-280
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi

2021-01-26 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272397#comment-17272397
 ] 

Vinoth Govindarajan commented on HUDI-824:
--

[~nagarwal] - All the apache projects are available directly to use with 
`–packages` option, I tried with pyspark it worked:


{code:java}
spark-shell \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
{code}
 

The same instructions have been updated in the following doc:
[https://hudi.apache.org/docs/quick-start-guide.html]

 

No further action need, let me know if its okay to close this issue.

 

> Register hudi-spark package with spark packages repo for easier usage of Hudi
> -
>
> Key: HUDI-824
> URL: https://issues.apache.org/jira/browse/HUDI-824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: user-support-issues
>
> At the moment, to be able to use Hudi with spark, users have to do the 
> following : 
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>   --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` 
> \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> {{}}
> {{Ideally, we want to be able to use Hudi as follows :}}
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages 
> org.apache.hudi:hudi-spark-bundle: \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-282) Update documentation to reflect additional option of HiveSync via metastore

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-282.
--
Fix Version/s: 0.5.2
   Resolution: Fixed

> Update documentation to reflect additional option of HiveSync via metastore
> ---
>
> Key: HUDI-282
> URL: https://issues.apache.org/jira/browse/HUDI-282
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-295) Do one-time cleanup of Hudi git history

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272396#comment-17272396
 ] 

sivabalan narayanan commented on HUDI-295:
--

[~vinoth]: is this still valid ?

> Do one-time cleanup of Hudi git history
> ---
>
> Key: HUDI-295
> URL: https://issues.apache.org/jira/browse/HUDI-295
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Vinoth Chandar
>Priority: Major
>
> https://lists.apache.org/thread.html/dc6eb516e248088dac1a2b5c9690383dfe2eb3912f76bbe9dd763c2b@



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-303) Avro schema case sensitivity testing

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-303:
-
Labels: user-support-issues  (was: bug-bash-0.6.0)

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: liwei
>Priority: Minor
>  Labels: user-support-issues
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-306) Get to Hudi to support AWS Glue Catalog and other Hive Metastore implementations

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-306.
--
Fix Version/s: 0.5.2
   Resolution: Fixed

> Get to Hudi to support AWS Glue Catalog and other Hive Metastore 
> implementations
> 
>
> Key: HUDI-306
> URL: https://issues.apache.org/jira/browse/HUDI-306
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi currently does not work with AWS Glue Catalog. The issue/exception it 
> runs into has been reported here as well 
> [issue|https://github.com/apache/incubator-hudi/issues/954] .
> As mentioned in the issue, the reason for this is:
>  * Currently Hudi is interacting with Hive through two different ways:
>  ** Creation of table statement is submitted directly to Hive via JDBC 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L472]
>  . Thus, Hive will internally create the right metastore client (i.e. Glue if 
> {{*hive.metastore.client.factory.class*}} is set to 
> {{*com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory*}}
>  in hive-site)
>  ** Whereas partition listing among other things are being done by directly 
> calling hive metastore APIs using hive metastore client: 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L240]
>  * Now in Hudi code, standard specific implementation of the metastore client 
> (not glue metastore client) is being instantiated: 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L109]
>  .
>  * Ideally this instantiation of metastore client should be left to Hive 
> through 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5045]
>  for it to consider other implementations of metastore client that might be 
> configured through {{*hive.metastore.client.factory.class*}} .
> That is the reason that table gets created in Glue metastore, but while 
> reading or scanning partitions it is talking to the local hive metastore 
> where it does not find the table created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-306) Get to Hudi to support AWS Glue Catalog and other Hive Metastore implementations

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reopened HUDI-306:
--

> Get to Hudi to support AWS Glue Catalog and other Hive Metastore 
> implementations
> 
>
> Key: HUDI-306
> URL: https://issues.apache.org/jira/browse/HUDI-306
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi currently does not work with AWS Glue Catalog. The issue/exception it 
> runs into has been reported here as well 
> [issue|https://github.com/apache/incubator-hudi/issues/954] .
> As mentioned in the issue, the reason for this is:
>  * Currently Hudi is interacting with Hive through two different ways:
>  ** Creation of table statement is submitted directly to Hive via JDBC 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L472]
>  . Thus, Hive will internally create the right metastore client (i.e. Glue if 
> {{*hive.metastore.client.factory.class*}} is set to 
> {{*com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory*}}
>  in hive-site)
>  ** Whereas partition listing among other things are being done by directly 
> calling hive metastore APIs using hive metastore client: 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L240]
>  * Now in Hudi code, standard specific implementation of the metastore client 
> (not glue metastore client) is being instantiated: 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L109]
>  .
>  * Ideally this instantiation of metastore client should be left to Hive 
> through 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5045]
>  for it to consider other implementations of metastore client that might be 
> configured through {{*hive.metastore.client.factory.class*}} .
> That is the reason that table gets created in Glue metastore, but while 
> reading or scanning partitions it is talking to the local hive metastore 
> where it does not find the table created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-306) Get to Hudi to support AWS Glue Catalog and other Hive Metastore implementations

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-306:
-
Status: Closed  (was: Patch Available)

> Get to Hudi to support AWS Glue Catalog and other Hive Metastore 
> implementations
> 
>
> Key: HUDI-306
> URL: https://issues.apache.org/jira/browse/HUDI-306
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi currently does not work with AWS Glue Catalog. The issue/exception it 
> runs into has been reported here as well 
> [issue|https://github.com/apache/incubator-hudi/issues/954] .
> As mentioned in the issue, the reason for this is:
>  * Currently Hudi is interacting with Hive through two different ways:
>  ** Creation of table statement is submitted directly to Hive via JDBC 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L472]
>  . Thus, Hive will internally create the right metastore client (i.e. Glue if 
> {{*hive.metastore.client.factory.class*}} is set to 
> {{*com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory*}}
>  in hive-site)
>  ** Whereas partition listing among other things are being done by directly 
> calling hive metastore APIs using hive metastore client: 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L240]
>  * Now in Hudi code, standard specific implementation of the metastore client 
> (not glue metastore client) is being instantiated: 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L109]
>  .
>  * Ideally this instantiation of metastore client should be left to Hive 
> through 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5045]
>  for it to consider other implementations of metastore client that might be 
> configured through {{*hive.metastore.client.factory.class*}} .
> That is the reason that table gets created in Glue metastore, but while 
> reading or scanning partitions it is talking to the local hive metastore 
> where it does not find the table created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-307) Dataframe written with Date,Timestamp, Decimal is read with same types

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-307.
--
Fix Version/s: (was: 0.8.0)
   0.7.0
   Resolution: Fixed

> Dataframe written with Date,Timestamp, Decimal is read with same types
> --
>
> Key: HUDI-307
> URL: https://issues.apache.org/jira/browse/HUDI-307
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Cosmin Iordache
>Assignee: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.7.0
>
>
> Small test for COW table to check the persistence of Date, Timestamp ,Decimal 
> types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-307) Dataframe written with Date,Timestamp, Decimal is read with same types

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-307:
-
Status: In Progress  (was: Open)

> Dataframe written with Date,Timestamp, Decimal is read with same types
> --
>
> Key: HUDI-307
> URL: https://issues.apache.org/jira/browse/HUDI-307
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Cosmin Iordache
>Assignee: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.8.0
>
>
> Small test for COW table to check the persistence of Date, Timestamp ,Decimal 
> types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272392#comment-17272392
 ] 

sivabalan narayanan commented on HUDI-310:
--

[~vinoth]: Is this still relevant? do we keep it open. 

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-310:
-
Labels: user-support-issues  (was: )

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: user-support-issues
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-318) Update Migration Guide to Include Delta Streamer

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-318:
-
Labels: doc user-support-issues  (was: doc)

> Update Migration Guide to Include Delta Streamer
> 
>
> Key: HUDI-318
> URL: https://issues.apache.org/jira/browse/HUDI-318
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Yanjia Gary Li
>Priority: Minor
>  Labels: doc, user-support-issues
>
> [http://hudi.apache.org/migration_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi

2021-01-26 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-824:
-
Status: Open  (was: New)

> Register hudi-spark package with spark packages repo for easier usage of Hudi
> -
>
> Key: HUDI-824
> URL: https://issues.apache.org/jira/browse/HUDI-824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: user-support-issues
>
> At the moment, to be able to use Hudi with spark, users have to do the 
> following : 
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>   --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` 
> \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> {{}}
> {{Ideally, we want to be able to use Hudi as follows :}}
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages 
> org.apache.hudi:hudi-spark-bundle: \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-360) Add github stale action workflow for issue management

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-360:
-
Labels: user-support-issues  (was: )

> Add github stale action workflow for issue management
> -
>
> Key: HUDI-360
> URL: https://issues.apache.org/jira/browse/HUDI-360
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: user-support-issues
>
> Add a GitHub action for closing stale (90 days) issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-352) The official documentation about project structure missed hudi-timeline-service module

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-352:
-
Labels: starter user-support-issues  (was: starter)

> The official documentation about project structure missed 
> hudi-timeline-service module
> --
>
> Key: HUDI-352
> URL: https://issues.apache.org/jira/browse/HUDI-352
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: vinoyang
>Priority: Major
>  Labels: starter, user-support-issues
>
> The official documentation about project structure[1] missed 
> hudi-timeline-service module, we should add it.
> [1]: http://hudi.apache.org/contributing.html#code--project-structure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-395) hudi does not support scheme s3n when wrtiing to S3

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-395:
-
Labels: user-support-issues  (was: bug-bash-0.6.0)

> hudi does not support scheme s3n when wrtiing to S3
> ---
>
> Key: HUDI-395
> URL: https://issues.apache.org/jira/browse/HUDI-395
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: newbie, Spark Integration, Usability
> Environment: spark-2.4.4-bin-hadoop2.7
>Reporter: rui feng
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: user-support-issues
>
> When I use Hudi to create a hudi table then write to s3, I used below maven 
> snnipet which is recommended by [https://hudi.apache.org/s3_hoodie.html]
> 
>  org.apache.hudi
>  hudi-spark-bundle
>  0.5.0-incubating
> 
> 
>  org.apache.hadoop
>  hadoop-aws
>  2.7.3
> 
> 
>  com.amazonaws
>  aws-java-sdk
>  1.10.34
> 
> and add the below configuration:
> sc.hadoopConfiguration.set("fs.defaultFS", "s3://niketest1")
>  sc.hadoopConfiguration.set("fs.s3.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>  sc.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>  sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "xx")
>  sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "x")
>  sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xx")
>  sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "x")
>  
> my spark version is spark-2.4.4-bin-hadoop2.7 and when I run below
> {color:#FF}df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath).{color}
> val hudiOptions = Map[String,String](
>  HoodieWriteConfig.TABLE_NAME -> "hudi12",
>  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
>  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "rider",
>  DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
> val hudiTablePath = "s3://niketest1/hudi_test/hudi12"
> the exception occur:
> j{color:#FF}ava.lang.IllegalArgumentException: 
> BlockAlignedAvroParquetWriter does not support scheme s3n{color}
>  at 
> org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.getHoodieScheme(HoodieWrapperFileSystem.java:109)
>  at 
> org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.convertToHoodiePath(HoodieWrapperFileSystem.java:85)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.(HoodieParquetWriter.java:57)
>  at 
> org.apache.hudi.io.storage.HoodieStorageWriterFactory.newParquetStorageWriter(HoodieStorageWriterFactory.java:60)
>  at 
> org.apache.hudi.io.storage.HoodieStorageWriterFactory.getStorageWriter(HoodieStorageWriterFactory.java:44)
>  at org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:70)
>  at 
> org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:137)
>  at 
> org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:120)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  
>  
> Is anyone can tell me what's cause this exception, I tried to use 
> org.apache.hadoop.fs.s3.S3FileSystem to replace 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem for the conf "fs.s3.impl", 
> but other exception occur and it seems org.apache.hadoop.fs.s3.S3FileSystem 
> fit hadoop 2.6.
>  
> Thanks advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-505) Add unified javadoc to the Hudi Website

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-505:
-
Labels: user-support-issues  (was: )

> Add unified javadoc to the Hudi Website
> ---
>
> Key: HUDI-505
> URL: https://issues.apache.org/jira/browse/HUDI-505
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-465) Make Hive Sync via Spark painless

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-465:
-
Labels: help-wanted starter user-support-issues  (was: help-wanted starter)

> Make Hive Sync via Spark painless
> -
>
> Key: HUDI-465
> URL: https://issues.apache.org/jira/browse/HUDI-465
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration, Spark Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Major
>  Labels: help-wanted, starter, user-support-issues
>
> Currently, we require many configs to be passed in for the Hive sync.. this 
> has to be simplified and experience should be close to how regular 
> spark.write.parquet registers into Hive.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-396) Provide an documentation to describe how to use test suite

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-396:
-
Labels: user-support-issues  (was: )

> Provide an documentation to describe how to use test suite
> --
>
> Key: HUDI-396
> URL: https://issues.apache.org/jira/browse/HUDI-396
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-632) Update documentation (docker_demo) to mention both commit and deltacommit files

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-632.
--
Fix Version/s: 0.7.0
   Resolution: Fixed

> Update documentation (docker_demo) to mention both commit and deltacommit 
> files 
> 
>
> Key: HUDI-632
> URL: https://issues.apache.org/jira/browse/HUDI-632
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Vikrant Goel
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the demo, we could have commit or deltacommit files created depending on 
> the type of table. Updating it will help avoid potential confusion.
> [https://hudi.incubator.apache.org/docs/docker_demo.html#step-2-incrementally-ingest-data-from-kafka-topic]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-648:
-
Labels: user-support-issues  (was: )

> Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction 
> writes
> 
>
> Key: HUDI-648
> URL: https://issues.apache.org/jira/browse/HUDI-648
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: user-support-issues
>
> We would like a way to hand the erroring records from writing or compaction 
> back to the users, in a separate table or log. This needs to work generically 
> across all the different writer paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-651:
-
Labels: pull-request-available user-support-issues  (was: 
pull-request-available)

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.8.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> 

[jira] [Updated] (HUDI-691) hoodie.*.consume.* should be set whitelist in hive-site.xml

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-691:
-
Labels: user-support-issues  (was: )

> hoodie.*.consume.* should be set whitelist in hive-site.xml
> ---
>
> Key: HUDI-691
> URL: https://issues.apache.org/jira/browse/HUDI-691
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, newbie
>Reporter: Bhavani Sudha
>Assignee: GarudaGuo
>Priority: Minor
>  Labels: user-support-issues
> Fix For: 0.8.0
>
>
> More details in this GH issue - 
> https://github.com/apache/incubator-hudi/issues/910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-653) Add JMX Report Config to Doc

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-653.
--
Fix Version/s: 0.7.0
   Resolution: Fixed

> Add JMX Report Config to Doc
> 
>
> Key: HUDI-653
> URL: https://issues.apache.org/jira/browse/HUDI-653
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add Jmx Report Config to Doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-653) Add JMX Report Config to Doc

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-653:
-
Status: In Progress  (was: Open)

> Add JMX Report Config to Doc
> 
>
> Key: HUDI-653
> URL: https://issues.apache.org/jira/browse/HUDI-653
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add Jmx Report Config to Doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-653) Add JMX Report Config to Doc

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-653:
-
Status: Open  (was: New)

> Add JMX Report Config to Doc
> 
>
> Key: HUDI-653
> URL: https://issues.apache.org/jira/browse/HUDI-653
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add Jmx Report Config to Doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272380#comment-17272380
 ] 

sivabalan narayanan commented on HUDI-718:
--

[~afilipchik]: do you still face this issue? 

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-718:
-
Labels: user-support-issues  (was: bug-bash-0.6.0)

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-735) Improve deltastreamer error message when case mismatch of commandline arguments.

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-735:
-
Labels: user-support-issues  (was: )

> Improve deltastreamer error message when case mismatch of commandline 
> arguments.
> 
>
> Key: HUDI-735
> URL: https://issues.apache.org/jira/browse/HUDI-735
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, DeltaStreamer, Usability
>Reporter: Vinoth Chandar
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: user-support-issues
>
> Team,
> When following the blog "Change Capture Using AWS Database Migration
> Service and Hudi" with my own data set, the initial load works perfectly.
> When issuing the command with the DMS CDC files on S3, I get the following
> error:
> {code}
> 20/03/24 17:56:28 ERROR HoodieDeltaStreamer: Got error running delta sync
> once. Shutting down
> org.apache.hudi.exception.HoodieException: Please provide a valid schema
> provider class! at
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
> at
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
> {code}
> I tried using the  --schemaprovider-class
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source and provide
> the schema. The error does not occur but there are no write to Hudi.
> I am not performing any transformations (other than the DMS transform) and
> using default record key strategy.
> If the team has any pointers, please let me know.
> Thank you!
> ---
> Thank you Vinoth. I was able to find the issue. All my column names were in
> high caps case. I switched column names and table names to lower case and
> it works perfectly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-767) Support transformation when export to Hudi

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-767:
-
Labels: user-support-issues  (was: )

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-776) Document community support triage process

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-776:
-
Labels: user-support-issues  (was: )

> Document community support triage process 
> --
>
> Key: HUDI-776
> URL: https://issues.apache.org/jira/browse/HUDI-776
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, Release  Administrative
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: user-support-issues
>
> Per thread 
> https://lists.apache.org/thread.html/r0de5b576ea3db07e663d76d72196404b65f1624c298a6b335229c05d%40%3Cdev.hudi.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-788) Hudi writer and Hudi Reader split

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272377#comment-17272377
 ] 

sivabalan narayanan commented on HUDI-788:
--

[~hainanzhongjian]: do you mind fixing the translation. 

> Hudi writer and Hudi Reader split
> -
>
> Key: HUDI-788
> URL: https://issues.apache.org/jira/browse/HUDI-788
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: wangmeng
>Priority: Minor
>
> 举例:
>  * 
> 很多公司采用CDH搭建自己集群,但是CDH中对hive的集成版本一直停留在hive-exec-1.1.0。并且生产环境中hive版本升级到2.*很麻烦。顾很多公司的生产环境中hive版本一般都是hive-exec-1.1.0。
>  * 采用hudi 0.5.*版本时对hive支持都是2.3以上。这样会导致,用户在使用hudi时,set 
> hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat时出现问题。
>  * 顾设想,是否可以把hudi 写数据到dfs和hive集成 与 从hive中读取hudi数据逻辑分割开来。这样保证,hudi 
> 0.5.*版本可以集成hive 2.3.*,但是hive读只需要添加相应的hoodie-hadoop-mr-bundle -hive 
> 版本就可以。可以是hive--exec-1.1.0,也可以是-hive--exec-2.3.*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-801.
--
Fix Version/s: (was: 0.8.0)
   0.7.0
   Resolution: Fixed

> Add a way to postprocess schema after it is loaded from the schema provider
> ---
>
> Key: HUDI-801
> URL: https://issues.apache.org/jira/browse/HUDI-801
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Sometimes it is needed to postprocess schemas after they are fetched from the 
> external sources. Some examples of postprocessing:
>  * make sure all the defaults are set correctly, and update schema if not.
>  * insert marker columns into records with no fields (no writable as parquest)
>  * ...
> Would be great to have a way to plug in custom post processors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-801:


Assignee: Alexander Filipchik

> Add a way to postprocess schema after it is loaded from the schema provider
> ---
>
> Key: HUDI-801
> URL: https://issues.apache.org/jira/browse/HUDI-801
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Sometimes it is needed to postprocess schemas after they are fetched from the 
> external sources. Some examples of postprocessing:
>  * make sure all the defaults are set correctly, and update schema if not.
>  * insert marker columns into records with no fields (no writable as parquest)
>  * ...
> Would be great to have a way to plug in custom post processors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-801:
-
Status: In Progress  (was: Open)

> Add a way to postprocess schema after it is loaded from the schema provider
> ---
>
> Key: HUDI-801
> URL: https://issues.apache.org/jira/browse/HUDI-801
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Sometimes it is needed to postprocess schemas after they are fetched from the 
> external sources. Some examples of postprocessing:
>  * make sure all the defaults are set correctly, and update schema if not.
>  * insert marker columns into records with no fields (no writable as parquest)
>  * ...
> Would be great to have a way to plug in custom post processors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-801:
-
Status: Open  (was: New)

> Add a way to postprocess schema after it is loaded from the schema provider
> ---
>
> Key: HUDI-801
> URL: https://issues.apache.org/jira/browse/HUDI-801
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Sometimes it is needed to postprocess schemas after they are fetched from the 
> external sources. Some examples of postprocessing:
>  * make sure all the defaults are set correctly, and update schema if not.
>  * insert marker columns into records with no fields (no writable as parquest)
>  * ...
> Would be great to have a way to plug in custom post processors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-824:
-
Labels: user-support-issues  (was: )

> Register hudi-spark package with spark packages repo for easier usage of Hudi
> -
>
> Key: HUDI-824
> URL: https://issues.apache.org/jira/browse/HUDI-824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: user-support-issues
>
> At the moment, to be able to use Hudi with spark, users have to do the 
> following : 
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>   --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` 
> \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> {{}}
> {{Ideally, we want to be able to use Hudi as follows :}}
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages 
> org.apache.hudi:hudi-spark-bundle: \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-829) Efficiently reading hudi tables through spark-shell

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-829:
-
Labels: user-support-issues  (was: )

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-825) Write a small blog on how to use hudi-spark with pyspark

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-825:
-
Labels: user-support-issues  (was: )

> Write a small blog on how to use hudi-spark with pyspark
> 
>
> Key: HUDI-825
> URL: https://issues.apache.org/jira/browse/HUDI-825
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-837) Fix AvroKafkaSource to use the latest schema for reading

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-837:
-
Labels: pull-request-available user-support-issues  (was: bug-bash-0.6.0 
pull-request-available)

> Fix AvroKafkaSource to use the latest schema for reading
> 
>
> Key: HUDI-837
> URL: https://issues.apache.org/jira/browse/HUDI-837
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.8.0
>
>
> Currently we specify KafkaAvroDeserializer as the value for 
> value.deserializer in AvroKafkaSource. This implies the published record is 
> read using the same schema with which it was written even though the schema 
> got evolved in between. As a result, messages in incoming batch can have 
> different schemas. This has to be handled at the time of actually writing 
> records in parquet. 
> This Jira aims at providing an option to read all the messages with the same 
> schema by implementing a new custom deserializer class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-873) kafka connector support hudi sink

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-873:
-
Labels: user-support-issues  (was: )

> kafka  connector support hudi sink
> --
>
> Key: HUDI-873
> URL: https://issues.apache.org/jira/browse/HUDI-873
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-851) Add Documentation on partitioning data with examples and details on how to sync to Hive

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-851:
-
Labels: user-support-issues  (was: )

> Add Documentation on partitioning data with examples and details on how to 
> sync to Hive
> ---
>
> Key: HUDI-851
> URL: https://issues.apache.org/jira/browse/HUDI-851
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-874) Schema evolution does not work with AWS Glue catalog

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-874:
-
Labels: user-support-issues  (was: )

> Schema evolution does not work with AWS Glue catalog
> 
>
> Key: HUDI-874
> URL: https://issues.apache.org/jira/browse/HUDI-874
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Priority: Major
>  Labels: user-support-issues
>
> This issue has been discussed here 
> [https://github.com/apache/incubator-hudi/issues/1581] and at other places as 
> well. Glue catalog currently does not support *cascade* for *ALTER TABLE* 
> statements. As a result features like adding new columns to an existing table 
> does now work with glue catalog .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-874) Schema evolution does not work with AWS Glue catalog

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272368#comment-17272368
 ] 

sivabalan narayanan commented on HUDI-874:
--

[~uditme]: can you please look into this ticket when you can. 

> Schema evolution does not work with AWS Glue catalog
> 
>
> Key: HUDI-874
> URL: https://issues.apache.org/jira/browse/HUDI-874
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Priority: Major
>
> This issue has been discussed here 
> [https://github.com/apache/incubator-hudi/issues/1581] and at other places as 
> well. Glue catalog currently does not support *cascade* for *ALTER TABLE* 
> statements. As a result features like adding new columns to an existing table 
> does now work with glue catalog .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-893) Add spark datasource V2 reader support for Hudi tables

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-893:
-
Labels: user-support-issues  (was: )

> Add spark datasource V2 reader support for Hudi tables
> --
>
> Key: HUDI-893
> URL: https://issues.apache.org/jira/browse/HUDI-893
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nan Zhu
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-914) support different target data clusters

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-914:
-
Labels: user-support-issues  (was: )

> support different target data clusters
> --
>
> Key: HUDI-914
> URL: https://issues.apache.org/jira/browse/HUDI-914
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: user-support-issues
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently hudi-DeltaStreamer does not support writing to different target 
> clusters. The specific scenarios are as follows: Generally, Hudi tasks run on 
> an independent cluster. If you want to write data to the target data cluster, 
> you generally rely on core-site.xml and hdfs-site.xml; sometimes you will 
> encounter different targets. The data cluster writes data, but the cluster 
> running the hudi task does not have the core-site.xml and hdfs-site.xml of 
> the target cluster. Although specifying the namenode IP address of the target 
> cluster can be written, this loses HDFS high availability, so I plan to Use 
> the contents of the core-site.xml and hdfs-site.xml files of the target 
> cluster as configuration items and configure them in the 
> dfs-source.properties or kafka-source.properties file of Hudi.
> Is there a better way to solve this problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1007) When earliestOffsets is greater than checkpoint, Hudi will not be able to successfully consume data

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1007:
--
Labels: user-support-issues  (was: )

> When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> ---
>
> Key: HUDI-1007
> URL: https://issues.apache.org/jira/browse/HUDI-1007
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Use deltastreamer to consume kafka,
>  When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen#checkupValidOffsets
> boolean checkpointOffsetReseter = checkpointOffsets.entrySet().stream()
>  .anyMatch(offset -> offset.getValue() < 
> earliestOffsets.get(offset.getKey()));
> return checkpointOffsetReseter ? earliestOffsets : checkpointOffsets;
> Kafka data is continuously generated, which means that some data will 
> continue to expire.
>  When earliestOffsets is greater than checkpoint, earliestOffsets will be 
> taken. But at this moment, some data expired. In the end, consumption fails. 
> This process is an endless cycle. I can understand that this design may be to 
> avoid the loss of data, but it will lead to such a situation, I want to fix 
> this problem, I want to hear your opinion  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] kimberlyamandalu commented on issue #1737: [SUPPORT]spark streaming create small parquet files

2021-01-26 Thread GitBox


kimberlyamandalu commented on issue #1737:
URL: https://github.com/apache/hudi/issues/1737#issuecomment-767786456


   @nsivabalan I am also seeing the same behavior for my workload. Compaction 
seems to be occurring for my MOR table as per 
hoodie.compact.inline.max.delta.commits.
   
   However, cleaner does not seem to get triggered even after setting 
hoodie.cleaner.commits.retained. 
   I don't see any cleans requested in the .hoodie folder.
   
   Can we follow up on this?
   Is anyone still observing the same problem?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi opened a new pull request #2493: [WIP] Change another way to convert Path with Scheme

2021-01-26 Thread GitBox


zhedoubushishi opened a new pull request #2493:
URL: https://github.com/apache/hudi/pull/2493


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2492: [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism

2021-01-26 Thread GitBox


codecov-io commented on pull request #2492:
URL: https://github.com/apache/hudi/pull/2492#issuecomment-767752039


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=h1) Report
   > Merging 
[#2492](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=desc) (2e615c3) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc)
 (c4afd17) will **increase** coverage by `19.29%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2492/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2492   +/-   ##
   =
   + Coverage 50.18%   69.48%   +19.29% 
   + Complexity 3050  358 -2692 
   =
 Files   419   53  -366 
 Lines 18931 1930-17001 
 Branches   1948  230 -1718 
   =
   - Hits   9500 1341 -8159 
   + Misses 8656  456 -8200 
   + Partials775  133  -642 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2492?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../main/java/org/apache/hudi/util/AvroConvertor.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL0F2cm9Db252ZXJ0b3IuamF2YQ==)
 | | | |
   | 
[...he/hudi/common/table/log/block/HoodieLogBlock.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVMb2dCbG9jay5qYXZh)
 | | | |
   | 
[...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=)
 | | | |
   | 
[...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=)
 | | | |
   | 
[...e/hudi/common/engine/HoodieLocalEngineContext.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Ib29kaWVMb2NhbEVuZ2luZUNvbnRleHQuamF2YQ==)
 | | | |
   | 
[.../org/apache/hudi/exception/HoodieKeyException.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUtleUV4Y2VwdGlvbi5qYXZh)
 | | | |
   | 
[...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=)
 | | | |
   | 
[...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZUlucHV0Rm9ybWF0VXRpbHMuamF2YQ==)
 | | | |
   | 
[...i/hive/SlashEncodedDayPartitionValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvU2xhc2hFbmNvZGVkRGF5UGFydGl0aW9uVmFsdWVFeHRyYWN0b3IuamF2YQ==)
 | | | |
   | 
[.../apache/hudi/timeline/service/TimelineService.java](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvVGltZWxpbmVTZXJ2aWNlLmphdmE=)
 | | | |
   | ... and [356 
more](https://codecov.io/gh/apache/hudi/pull/2492/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

2021-01-26 Thread GitBox


vinothchandar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767720824


   I think aws has to support/recompile against their spark version. cc 
@umehrot2 
   
   for now, you can test using apache spark ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] fripple commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

2021-01-26 Thread GitBox


fripple commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767706110


   Yes, I'm using spark as provided by AWS. Is there any way to make this work 
or am I out of luck until AWS EMR supports hudi 0.7?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

2021-01-26 Thread GitBox


vinothchandar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767703987


   This error seems to be due to using the aws spark distro? This change would 
work with any table written using previous versions. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #2491: [MINOR] Update doap with 0.7.0 release

2021-01-26 Thread GitBox


vinothchandar merged pull request #2491:
URL: https://github.com/apache/hudi/pull/2491


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [MINOR] Update doap with 0.7.0 release (#2491)

2021-01-26 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c8ee40f  [MINOR] Update doap with 0.7.0 release (#2491)
c8ee40f is described below

commit c8ee40f8ae34607072a27d4e7ccb21fc4df13ca1
Author: vinoth chandar 
AuthorDate: Tue Jan 26 09:28:22 2021 -0800

[MINOR] Update doap with 0.7.0 release (#2491)
---
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 77db135..06d5128 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -61,6 +61,11 @@
 2020-08-22
 0.6.0
   
+  
+Apache Hudi 0.7.0
+2021-01-25
+0.7.0
+  
 
 
   



[jira] [Commented] (HUDI-1288) DeltaSync:writeToSink fails with Unknown datum type org.apache.avro.JsonProperties$Null

2021-01-26 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272267#comment-17272267
 ] 

Vinoth Chandar commented on HUDI-1288:
--

 https://cwiki.apache.org/confluence/display/HUDI/Release+Management talks 
about this in more detail. We are not planning on doing backports, rather we 
want to make rolling forward to a newer release much easier/smoother. 


> DeltaSync:writeToSink fails with Unknown datum type 
> org.apache.avro.JsonProperties$Null
> ---
>
> Key: HUDI-1288
> URL: https://issues.apache.org/jira/browse/HUDI-1288
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Michal Swiatowy
>Priority: Major
>  Labels: user-support-issues
>
> After updating to Hudi version 0.5.3 (prev. 0.5.2-incubating) I run into 
> following error message on write to HDFS:
> {code:java}
> 2020-09-18 12:54:38,651 [Driver] INFO  
> HoodieTableMetaClient:initTableAndGetMetaClient:379 - Finished initializing 
> Table of type MERGE_ON_READ from 
> /master_data/6FQS/hudi_test/S_INCOMINGMESSAGEDETAIL_CDC
> 2020-09-18 12:54:38,663 [Driver] INFO  DeltaSync:setupWriteClient:470 - 
> Setting up Hoodie Write Client
> 2020-09-18 12:54:38,695 [Driver] INFO  DeltaSync:registerAvroSchemas:522 - 
> Registering Schema 
> 

[jira] [Resolved] (HUDI-1029) Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-1029.
---
  Assignee: Pratyaksh Sharma
Resolution: Invalid

> Use FastDateFormat  for parsing  and formating in TimestampBasedKeyGenerator
> 
>
> Key: HUDI-1029
> URL: https://issues.apache.org/jira/browse/HUDI-1029
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: steven zhang
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> 1. in TimestampBasedKeyGenerator#getKey method, generate a HoodieKey will 
> create  a new SimpleDateFormat object,the dateformat object can be reused as 
> class variable.
> 2. SimpleDateFormat is not thread safe,there is always potential thread safe 
> problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1024) Document S3 related guide and tips

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1024:
--
Labels: documentation user-support-issues  (was: documentation)

> Document S3 related guide and tips
> --
>
> Key: HUDI-1024
> URL: https://issues.apache.org/jira/browse/HUDI-1024
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Priority: Minor
>  Labels: documentation, user-support-issues
> Fix For: 0.8.0
>
>
> Create a section in docs website for Hudi on S3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1022) Document examples for Spark structured streaming writing into Hudi

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1022:
--
Labels: user-support-issues  (was: )

> Document examples for Spark structured streaming writing into Hudi
> --
>
> Key: HUDI-1022
> URL: https://issues.apache.org/jira/browse/HUDI-1022
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Usability
>Reporter: Bhavani Sudha
>Assignee: Felix Kizhakkel Jose
>Priority: Minor
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1020) Making timeline server as an external long running service and extending it to be able to plugin business metadata

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1020:
--
Labels: user-support-issues  (was: )

> Making timeline server as an external long running service and extending it 
> to be able to plugin business metadata 
> ---
>
> Key: HUDI-1020
> URL: https://issues.apache.org/jira/browse/HUDI-1020
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Bhavani Sudha
>Priority: Major
>  Labels: user-support-issues
>
> Based on the description in the mailing thread - 
> [https://www.mail-archive.com/dev@hudi.apache.org/msg02917.html] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1029) Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1029:
--
Status: Open  (was: New)

> Use FastDateFormat  for parsing  and formating in TimestampBasedKeyGenerator
> 
>
> Key: HUDI-1029
> URL: https://issues.apache.org/jira/browse/HUDI-1029
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: steven zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> 1. in TimestampBasedKeyGenerator#getKey method, generate a HoodieKey will 
> create  a new SimpleDateFormat object,the dateformat object can be reused as 
> class variable.
> 2. SimpleDateFormat is not thread safe,there is always potential thread safe 
> problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1029) Use FastDateFormat for parsing and formating in TimestampBasedKeyGenerator

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1029:
--
Status: In Progress  (was: Open)

> Use FastDateFormat  for parsing  and formating in TimestampBasedKeyGenerator
> 
>
> Key: HUDI-1029
> URL: https://issues.apache.org/jira/browse/HUDI-1029
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: steven zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> 1. in TimestampBasedKeyGenerator#getKey method, generate a HoodieKey will 
> create  a new SimpleDateFormat object,the dateformat object can be reused as 
> class variable.
> 2. SimpleDateFormat is not thread safe,there is always potential thread safe 
> problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1292) [Umbrella] RFC-15 : File Listing and Query Planning Optimizations

2021-01-26 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272266#comment-17272266
 ] 

Vinoth Chandar commented on HUDI-1292:
--

[~shivnarayan] why is this marked as a user-support-issue? 

> [Umbrella] RFC-15 : File Listing and Query Planning Optimizations 
> --
>
> Key: HUDI-1292
> URL: https://issues.apache.org/jira/browse/HUDI-1292
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.8.0
>
>
> This is the umbrella ticket that tracks the overall implementation of RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1058) Make delete marker configurable

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1058:
--
Labels: pull-request-available user-support-issues  (was: 
pull-request-available)

> Make delete marker configurable
> ---
>
> Key: HUDI-1058
> URL: https://issues.apache.org/jira/browse/HUDI-1058
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Raymond Xu
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available, user-support-issues
>
> users can specify any boolean field for delete marker and 
> `_hoodie_is_deleted` remains as default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272264#comment-17272264
 ] 

sivabalan narayanan commented on HUDI-1036:
---

[~nishith29]: any follow up on this. 

> HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
> ---
>
> Key: HUDI-1036
> URL: https://issues.apache.org/jira/browse/HUDI-1036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Bhavani Sudha
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
>
> Opening this Jira based on the GitHub issue reported here - 
> [https://github.com/apache/hudi/issues/1735] when hive.input.format = 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to 
> create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub 
> issue more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit

2021-01-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1036:
--
Labels: user-support-issues  (was: )

> HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
> ---
>
> Key: HUDI-1036
> URL: https://issues.apache.org/jira/browse/HUDI-1036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Bhavani Sudha
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.8.0
>
>
> Opening this Jira based on the GitHub issue reported here - 
> [https://github.com/apache/hudi/issues/1735] when hive.input.format = 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to 
> create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub 
> issue more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >