date:20210730

[jira] [Commented] (HUDI-2256) Remove the while loop from BucketAssigner new bucket id algorithm

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390894#comment-17390894
 ] 

ASF GitHub Bot commented on HUDI-2256:
--

hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   * b8331b618e320dfc8704babb533d6287ff2480f3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1281)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove the while loop from BucketAssigner new bucket id algorithm
> -
>
> Key: HUDI-2256
> URL: https://issues.apache.org/jira/browse/HUDI-2256
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3374: [HUDI-2256] Remove the while loop from BucketAssigner new bucket id a…

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   * b8331b618e320dfc8704babb533d6287ff2480f3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1281)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390892#comment-17390892
 ] 

ASF GitHub Bot commented on HUDI-1138:
--

yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311690



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();
+  //

[jira] [Created] (HUDI-2258) Metadata table for flink

2021-07-30 Thread Danny Chen (Jira)

Danny Chen created HUDI-2258:


 Summary: Metadata table for flink
 Key: HUDI-2258
 URL: https://issues.apache.org/jira/browse/HUDI-2258
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Flink Integration
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.9.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390893#comment-17390893
 ] 

ASF GitHub Bot commented on HUDI-1138:
--

yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311786



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();
+  //

[GitHub] [hudi] yihua commented on a change in pull request #3233: [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

2021-07-30 Thread GitBox



yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311786



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();
+  // A list of pending futures from async marker creation requests
+  private final List createMarkerFutures = new 
ArrayList<>();
+  // A list of use status of marker files. {@code true} means the file is in 
use by a {@code

[GitHub] [hudi] yihua commented on a change in pull request #3233: [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

2021-07-30 Thread GitBox



yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311690



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();
+  // A list of pending futures from async marker creation requests
+  private final List createMarkerFutures = new 
ArrayList<>();
+  // A list of use status of marker files. {@code true} means the file is in 
use by a {@code

[jira] [Commented] (HUDI-2256) Remove the while loop from BucketAssigner new bucket id algorithm

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390891#comment-17390891
 ] 

ASF GitHub Bot commented on HUDI-2256:
--

hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   * b8331b618e320dfc8704babb533d6287ff2480f3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove the while loop from BucketAssigner new bucket id algorithm
> -
>
> Key: HUDI-2256
> URL: https://issues.apache.org/jira/browse/HUDI-2256
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390890#comment-17390890
 ] 

ASF GitHub Bot commented on HUDI-1138:
--

yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311587



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();
+  //

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390889#comment-17390889
 ] 

ASF GitHub Bot commented on HUDI-1138:
--

yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311535



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();

[GitHub] [hudi] hudi-bot edited a comment on pull request #3374: [HUDI-2256] Remove the while loop from BucketAssigner new bucket id a…

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   * b8331b618e320dfc8704babb533d6287ff2480f3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #3233: [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

2021-07-30 Thread GitBox



yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311587



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();
+  // A list of pending futures from async marker creation requests
+  private final List createMarkerFutures = new 
ArrayList<>();
+  // A list of use status of marker files. {@code true} means the file is in 
use by a {@code

[GitHub] [hudi] yihua commented on a change in pull request #3233: [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

2021-07-30 Thread GitBox



yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680311535



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;
+
+  private final Registry metricsRegistry;
+  private final ScheduledExecutorService executorService;
+  // A cached copy of all markers in memory
+  // Mapping: {markerDirPath -> all markers}
+  private final Map> allMarkersMap = new HashMap<>();
+  // A cached copy of marker entries in each marker file, stored in 
StringBuilder for efficient appending
+  // Mapping: {markerDirPath -> {markerFileIndex -> markers}}
+  private final Map> fileMarkersMap = new 
HashMap<>();

Review comment:
   Yes, I added a new class, `MarkerDirState`, to store the states of a 
single marker directory and operate on the markers.

##
File path:

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390888#comment-17390888
 ] 

ASF GitHub Bot commented on HUDI-1138:
--

yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680310395



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;

Review comment:
   Based on the discussion, we don't use this config anymore.

##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this

[GitHub] [hudi] yihua commented on a change in pull request #3233: [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

2021-07-30 Thread GitBox



yihua commented on a change in pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#discussion_r680310395



##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.timeline.service.handlers;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.timeline.service.TimelineService;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.javalin.Context;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.timeline.service.RequestHandler.jsonifyResult;
+
+/**
+ * REST Handler servicing marker requests.
+ *
+ * The marker creation requests are handled asynchronous, while other types of 
requests
+ * are handled synchronous.
+ *
+ * Marker creation requests are batch processed periodically by a thread.  
Each batch
+ * processing thread adds new markers to a marker file.  Given that marker 
file operation
+ * can take time, multiple concurrent threads can run at the same, while they 
operate
+ * on different marker files storing mutually exclusive marker entries.  At 
any given
+ * time, a marker file is touched by at most one thread to guarantee 
consistency.
+ * Below is an example of running batch processing threads.
+ *
+ *   |-| batch interval
+ * Thread 1  |-->| writing to MARKERS1
+ * Thread 2|-->| writing to MARKERS2
+ * Thread 3   |-->| writing to MARKERS3
+ */
+public class MarkerHandler extends Handler {
+  public static final String MARKERS_FILENAME_PREFIX = "MARKERS";
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+  private static final Logger LOG = LogManager.getLogger(MarkerHandler.class);
+  // Margin time for scheduling the processing of the next batch of marker 
creation requests
+  private static final long SCHEDULING_MARGIN_TIME_MS = 5L;

Review comment:
   Based on the discussion, we don't use this config anymore.

##
File path: 
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390887#comment-17390887
 ] 

ASF GitHub Bot commented on HUDI-1138:
--

hudi-bot edited a comment on pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#issuecomment-875280958


   
   ## CI report:
   
   * 2d22335c215ed620ce20018b1c83be189b7c70c6 UNKNOWN
   * 230205edfab190cfaf687d0323ae8d704f425e1d UNKNOWN
   * e689b18e9261a07f6eeaf109a11237e89a218d5b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1181)
 
   * cc99aa727399086ab215ce85db6f615711a7816f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3233: [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3233:
URL: https://github.com/apache/hudi/pull/3233#issuecomment-875280958


   
   ## CI report:
   
   * 2d22335c215ed620ce20018b1c83be189b7c70c6 UNKNOWN
   * 230205edfab190cfaf687d0323ae8d704f425e1d UNKNOWN
   * e689b18e9261a07f6eeaf109a11237e89a218d5b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1181)
 
   * cc99aa727399086ab215ce85db6f615711a7816f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1771) Propagate CDC format for hoodie

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390883#comment-17390883
 ] 

ASF GitHub Bot commented on HUDI-1771:
--

hudi-bot edited a comment on pull request #3285:
URL: https://github.com/apache/hudi/pull/3285#issuecomment-881141261


   
   ## CI report:
   
   * 4660e96db4081115eaa7877b8584466347f78fea UNKNOWN
   * dba29b278f11ed80375f325d6d40b790a6498266 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1279)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Zheng yunhong
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3285: [HUDI-1771] Propagate CDC format for hoodie

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3285:
URL: https://github.com/apache/hudi/pull/3285#issuecomment-881141261


   
   ## CI report:
   
   * 4660e96db4081115eaa7877b8584466347f78fea UNKNOWN
   * dba29b278f11ed80375f325d6d40b790a6498266 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1279)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2141) Integration flink metric in flink stream

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390882#comment-17390882
 ] 

ASF GitHub Bot commented on HUDI-2141:
--

hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 8018dce8bb833468882f26dff112ed6136681bf3 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1280)
 
   * 3250cbc21c3714294e80eb796d79cebdab060c56 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration flink metric in flink stream
> 
>
> Key: HUDI-2141
> URL: https://issues.apache.org/jira/browse/HUDI-2141
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
>
> Now hoodie metrics can't work in flink stream because Designed for batch 
> processing,  integration flink metric in flink stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3235: [HUDI-2141] Integration flink metric in flink stream

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 8018dce8bb833468882f26dff112ed6136681bf3 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1280)
 
   * 3250cbc21c3714294e80eb796d79cebdab060c56 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2141) Integration flink metric in flink stream

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390881#comment-17390881
 ] 

ASF GitHub Bot commented on HUDI-2141:
--

hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 12f9fc5e0391242aaddc4e0c09ada8ee3c745a47 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=819)
 
   * 8018dce8bb833468882f26dff112ed6136681bf3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1280)
 
   * 3250cbc21c3714294e80eb796d79cebdab060c56 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration flink metric in flink stream
> 
>
> Key: HUDI-2141
> URL: https://issues.apache.org/jira/browse/HUDI-2141
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
>
> Now hoodie metrics can't work in flink stream because Designed for batch 
> processing,  integration flink metric in flink stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3235: [HUDI-2141] Integration flink metric in flink stream

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 12f9fc5e0391242aaddc4e0c09ada8ee3c745a47 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=819)
 
   * 8018dce8bb833468882f26dff112ed6136681bf3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1280)
 
   * 3250cbc21c3714294e80eb796d79cebdab060c56 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2141) Integration flink metric in flink stream

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390878#comment-17390878
 ] 

ASF GitHub Bot commented on HUDI-2141:
--

hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 12f9fc5e0391242aaddc4e0c09ada8ee3c745a47 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=819)
 
   * 8018dce8bb833468882f26dff112ed6136681bf3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1280)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration flink metric in flink stream
> 
>
> Key: HUDI-2141
> URL: https://issues.apache.org/jira/browse/HUDI-2141
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
>
> Now hoodie metrics can't work in flink stream because Designed for batch 
> processing,  integration flink metric in flink stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3235: [HUDI-2141] Integration flink metric in flink stream

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 12f9fc5e0391242aaddc4e0c09ada8ee3c745a47 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=819)
 
   * 8018dce8bb833468882f26dff112ed6136681bf3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1280)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2141) Integration flink metric in flink stream

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390877#comment-17390877
 ] 

ASF GitHub Bot commented on HUDI-2141:
--

hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 12f9fc5e0391242aaddc4e0c09ada8ee3c745a47 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=819)
 
   * 8018dce8bb833468882f26dff112ed6136681bf3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integration flink metric in flink stream
> 
>
> Key: HUDI-2141
> URL: https://issues.apache.org/jira/browse/HUDI-2141
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
>
> Now hoodie metrics can't work in flink stream because Designed for batch 
> processing,  integration flink metric in flink stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3235: [HUDI-2141] Integration flink metric in flink stream

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3235:
URL: https://github.com/apache/hudi/pull/3235#issuecomment-875512974


   
   ## CI report:
   
   * 12f9fc5e0391242aaddc4e0c09ada8ee3c745a47 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=819)
 
   * 8018dce8bb833468882f26dff112ed6136681bf3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2257) Add a note to set keygenerator class while deleting data

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390869#comment-17390869
 ] 

ASF GitHub Bot commented on HUDI-2257:
--

veenaypatil commented on pull request #3375:
URL: https://github.com/apache/hudi/pull/3375#issuecomment-890289062


   @nsivabalan updated .md files 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add a note to set keygenerator class while deleting data
> 
>
> Key: HUDI-2257
> URL: https://issues.apache.org/jira/browse/HUDI-2257
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Vinay
>Assignee: Vinay
>Priority: Minor
>  Labels: pull-request-available
>
> Copying examples from this blog 
> [https://hudi.apache.org/blog/delete-support-in-hudi/] , does not work as is 
> for Non-Partitioned table, user have to explicitly set the following option 
> in order for delete to work
> {code:java}
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.NonpartitionedKeyGenerator")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] veenaypatil commented on pull request #3375: [HUDI-2257] Adding note to set Keygen class while deleting data

2021-07-30 Thread GitBox



veenaypatil commented on pull request #3375:
URL: https://github.com/apache/hudi/pull/3375#issuecomment-890289062


   @nsivabalan updated .md files 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1771) Propagate CDC format for hoodie

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390865#comment-17390865
 ] 

ASF GitHub Bot commented on HUDI-1771:
--

hudi-bot edited a comment on pull request #3285:
URL: https://github.com/apache/hudi/pull/3285#issuecomment-881141261


   
   ## CI report:
   
   * 4660e96db4081115eaa7877b8584466347f78fea UNKNOWN
   * f402b2e5749ed4f6af1bd1ed67ea38318e5261ca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1274)
 
   * dba29b278f11ed80375f325d6d40b790a6498266 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1279)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Zheng yunhong
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3285: [HUDI-1771] Propagate CDC format for hoodie

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3285:
URL: https://github.com/apache/hudi/pull/3285#issuecomment-881141261


   
   ## CI report:
   
   * 4660e96db4081115eaa7877b8584466347f78fea UNKNOWN
   * f402b2e5749ed4f6af1bd1ed67ea38318e5261ca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1274)
 
   * dba29b278f11ed80375f325d6d40b790a6498266 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1279)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1771) Propagate CDC format for hoodie

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390860#comment-17390860
 ] 

ASF GitHub Bot commented on HUDI-1771:
--

hudi-bot edited a comment on pull request #3285:
URL: https://github.com/apache/hudi/pull/3285#issuecomment-881141261


   
   ## CI report:
   
   * 4660e96db4081115eaa7877b8584466347f78fea UNKNOWN
   * f402b2e5749ed4f6af1bd1ed67ea38318e5261ca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1274)
 
   * dba29b278f11ed80375f325d6d40b790a6498266 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Zheng yunhong
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3285: [HUDI-1771] Propagate CDC format for hoodie

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3285:
URL: https://github.com/apache/hudi/pull/3285#issuecomment-881141261


   
   ## CI report:
   
   * 4660e96db4081115eaa7877b8584466347f78fea UNKNOWN
   * f402b2e5749ed4f6af1bd1ed67ea38318e5261ca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1274)
 
   * dba29b278f11ed80375f325d6d40b790a6498266 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2247) Filter file where length less than parquet MAGIC length

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390859#comment-17390859
 ] 

ASF GitHub Bot commented on HUDI-2247:
--

yuzhaojing closed pull request #3363:
URL: https://github.com/apache/hudi/pull/3363


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter file where length less than parquet MAGIC length
> ---
>
> Key: HUDI-2247
> URL: https://issues.apache.org/jira/browse/HUDI-2247
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2247) Filter file where length less than parquet MAGIC length

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390858#comment-17390858
 ] 

ASF GitHub Bot commented on HUDI-2247:
--

yuzhaojing opened a new pull request #3363:
URL: https://github.com/apache/hudi/pull/3363


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter file where length less than parquet MAGIC length
> ---
>
> Key: HUDI-2247
> URL: https://issues.apache.org/jira/browse/HUDI-2247
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] yuzhaojing closed pull request #3363: [HUDI-2247] Filter file where length less than parquet MAGIC length

2021-07-30 Thread GitBox



yuzhaojing closed pull request #3363:
URL: https://github.com/apache/hudi/pull/3363


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2256) Remove the while loop from BucketAssigner new bucket id algorithm

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390852#comment-17390852
 ] 

ASF GitHub Bot commented on HUDI-2256:
--

hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove the while loop from BucketAssigner new bucket id algorithm
> -
>
> Key: HUDI-2256
> URL: https://issues.apache.org/jira/browse/HUDI-2256
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3374: [HUDI-2256] Remove the while loop from BucketAssigner new bucket id a…

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-2249) [SQL] Changing index type fails

2021-07-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-2249.
---
  Assignee: sivabalan narayanan
Resolution: Invalid

> [SQL] Changing index type fails
> ---
>
> Key: HUDI-2249
> URL: https://issues.apache.org/jira/browse/HUDI-2249
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: release-blocker
>
> I tried to set a different index type and it failed. 
>  
> ```
> set hoodie.index.type = SIMPLE
>  
> spark-sql> create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:24:23 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hudi.index.HoodieIndex.IndexType.SIMPLE
>  
>  
> describe hudi_17Gb_ext
>  at java.lang.Enum.valueOf(Enum.java:238)
>  at org.apache.hudi.index.HoodieIndex$IndexType.valueOf(HoodieIndex.java:106)
>  at 
> org.apache.hudi.config.HoodieIndexConfig$Builder.build(HoodieIndexConfig.java:333)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:1608)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:1650)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:196)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:201)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$5(HoodieSparkSqlWriter.scala:183)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:182)
>  at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:97)
>  at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2249) [SQL] Changing index type fails

2021-07-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390850#comment-17390850
 ] 

sivabalan narayanan commented on HUDI-2249:
---

actually this works for me. not sure if I missed something earlier when I ran 
into this. Closing this for now. 

> [SQL] Changing index type fails
> ---
>
> Key: HUDI-2249
> URL: https://issues.apache.org/jira/browse/HUDI-2249
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Blocker
>  Labels: release-blocker
>
> I tried to set a different index type and it failed. 
>  
> ```
> set hoodie.index.type = SIMPLE
>  
> spark-sql> create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:24:23 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hudi.index.HoodieIndex.IndexType.SIMPLE
>  
>  
> describe hudi_17Gb_ext
>  at java.lang.Enum.valueOf(Enum.java:238)
>  at org.apache.hudi.index.HoodieIndex$IndexType.valueOf(HoodieIndex.java:106)
>  at 
> org.apache.hudi.config.HoodieIndexConfig$Builder.build(HoodieIndexConfig.java:333)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:1608)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:1650)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:196)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:201)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$5(HoodieSparkSqlWriter.scala:183)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:182)
>  at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:97)
>  at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2249) [SQL] Changing index type fails

2021-07-30 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2249:
--
Status: In Progress  (was: Open)

> [SQL] Changing index type fails
> ---
>
> Key: HUDI-2249
> URL: https://issues.apache.org/jira/browse/HUDI-2249
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Blocker
>  Labels: release-blocker
>
> I tried to set a different index type and it failed. 
>  
> ```
> set hoodie.index.type = SIMPLE
>  
> spark-sql> create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:24:23 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hudi.index.HoodieIndex.IndexType.SIMPLE
>  
>  
> describe hudi_17Gb_ext
>  at java.lang.Enum.valueOf(Enum.java:238)
>  at org.apache.hudi.index.HoodieIndex$IndexType.valueOf(HoodieIndex.java:106)
>  at 
> org.apache.hudi.config.HoodieIndexConfig$Builder.build(HoodieIndexConfig.java:333)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:1608)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:1650)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:196)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:201)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$5(HoodieSparkSqlWriter.scala:183)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:182)
>  at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:97)
>  at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2232) [SQL] MERGE INTO fails with table having nested struct and partioned by

2021-07-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390849#comment-17390849
 ] 

sivabalan narayanan commented on HUDI-2232:
---

I could also reproduce this

> [SQL] MERGE INTO fails with table having nested struct and partioned by
> ---
>
> Key: HUDI-2232
> URL: https://issues.apache.org/jira/browse/HUDI-2232
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.9.0
>
>
> {code:java}
> // TO reproduce
> drop table if exists hudi_gh_ext_fixed;
> create table hudi_gh_ext_fixed (  id int,   name string,   price double,   ts 
> long,   repo struct) using hudi options(primaryKey = 
> 'id', precombineField = 'ts') location 'file:///tmp/hudi-h5-fixed';
> insert into hudi_gh_ext_fixed values(3, 'AMZN', 300, 120, 
> struct(234273476,"onnet/onnet-portal"));
> insert into hudi_gh_ext_fixed values(2, 'UBER', 300, 120, 
> struct(234273476,"onnet/onnet-portal"));
> insert into hudi_gh_ext_fixed values(4, 'GOOG', 300, 120, 
> struct(234273476,"onnet/onnet-portal"));
> update hudi_gh_ext_fixed set price = 150.0 where name = 'UBER';
> drop table if exists hudi_fixed;
> create table hudi_fixed (  id int,   name string,   price double,   ts long,  
>  repo struct) using hudi options(primaryKey = 'id', 
> precombineField = 'ts') partitioned by (ts) location 
> 'file:///tmp/hudi-h5-part-fixed';
> insert into hudi_fixed values(2, 'UBER', 200, 
> struct(234273476,"onnet/onnet-portal"), 130);
> select * from hudi_gh_ext_fixed;
> 20210727145240  20210727145240_0_6442266  id:3
> 77fc2e3e-add9-4f08-a5e1-9671d66add26-0_0-1472-72063_20210727145240.parquet  3 
> AMZN  300.0 120 {"id":234273476,"name":"onnet/onnet-portal"}20210727145301  
> 20210727145301_0_6442269  id:2
> 77fc2e3e-add9-4f08-a5e1-9671d66add26-0_0-1565-77094_20210727145301.parquet  2 
> UBER  150.0 120 {"id":234273476,"name":"onnet/onnet-portal"}20210727145254  
> 20210727145254_0_6442268  id:4
> 77fc2e3e-add9-4f08-a5e1-9671d66add26-0_0-1534-75283_20210727145254.parquet  4 
> GOOG  300.0 120 {"id":234273476,"name":"onnet/onnet-portal"}
> select * from hudi_fixed;
> 20210727145325  20210727145325_0_6442270  id:2  ts=130  
> ba148271-68b4-40aa-816a-158170446e41-0_0-1595-78703_20210727145325.parquet  2 
> UBER  200.0 {"id":234273476,"name":"onnet/onnet-portal"}  130
> MERGE INTO hudi_fixed USING (select id, name, price, repo, ts from 
> hudi_gh_ext_fixed) updatesON hudi_fixed.id = updates.idWHEN MATCHED THEN  
> UPDATE SET *WHEN NOT MATCHED  THEN INSERT *;
> -- java.lang.IllegalArgumentException: UnSupport StructType yet--  at 
> org.apache.spark.sql.hudi.command.payload.SqlTypedRecord.convert(SqlTypedRecord.scala:122)--
>   at 
> org.apache.spark.sql.hudi.command.payload.SqlTypedRecord.get(SqlTypedRecord.scala:56)--
>   at 
> org.apache.hudi.sql.payload.ExpressionPayloadEvaluator_b695b02a_99b5_479e_8299_507da9b206fd.eval(Unknown
>  Source)--  at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$AvroTypeConvertEvaluator.eval(ExpressionPayload.scala:333)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2256) Remove the while loop from BucketAssigner new bucket id algorithm

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390848#comment-17390848
 ] 

ASF GitHub Bot commented on HUDI-2256:
--

hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * e01334153b6a9186064bfe7753a4d15903bc427f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1275)
 
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove the while loop from BucketAssigner new bucket id algorithm
> -
>
> Key: HUDI-2256
> URL: https://issues.apache.org/jira/browse/HUDI-2256
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3374: [HUDI-2256] Remove the while loop from BucketAssigner new bucket id a…

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * e01334153b6a9186064bfe7753a4d15903bc427f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1275)
 
   * 35cea92e2586fd6df21aa8cdf113337813813a89 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2256) Remove the while loop from BucketAssigner new bucket id algorithm

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390847#comment-17390847
 ] 

ASF GitHub Bot commented on HUDI-2256:
--

hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * e01334153b6a9186064bfe7753a4d15903bc427f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1275)
 
   * 35cea92e2586fd6df21aa8cdf113337813813a89 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove the while loop from BucketAssigner new bucket id algorithm
> -
>
> Key: HUDI-2256
> URL: https://issues.apache.org/jira/browse/HUDI-2256
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] hudi-bot edited a comment on pull request #3374: [HUDI-2256] Remove the while loop from BucketAssigner new bucket id a…

2021-07-30 Thread GitBox



hudi-bot edited a comment on pull request #3374:
URL: https://github.com/apache/hudi/pull/3374#issuecomment-889879689


   
   ## CI report:
   
   * e01334153b6a9186064bfe7753a4d15903bc427f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1275)
 
   * 35cea92e2586fd6df21aa8cdf113337813813a89 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2248) Unable to shutdown local metastore client

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390842#comment-17390842
 ] 

ASF GitHub Bot commented on HUDI-2248:
--

yanghua commented on a change in pull request #3364:
URL: https://github.com/apache/hudi/pull/3364#discussion_r680292542



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
##
@@ -295,7 +295,7 @@ public void close() {
 try {
   ddlExecutor.close();
   if (client != null) {
-client.close();
+Hive.closeCurrent();
 client = null;

Review comment:
   @jsbali I just quickly search the whole project, it seems there are 
still other cases that call the `close` method directly. Can you fix them 
passingly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unable to shutdown local metastore client
> -
>
> Key: HUDI-2248
> URL: https://issues.apache.org/jira/browse/HUDI-2248
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jagmeet Bali
>Priority: Minor
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/3187



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] yanghua commented on a change in pull request #3364: [HUDI-2248] Fixing the closing of hms client

2021-07-30 Thread GitBox



yanghua commented on a change in pull request #3364:
URL: https://github.com/apache/hudi/pull/3364#discussion_r680292542



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
##
@@ -295,7 +295,7 @@ public void close() {
 try {
   ddlExecutor.close();
   if (client != null) {
-client.close();
+Hive.closeCurrent();
 client = null;

Review comment:
   @jsbali I just quickly search the whole project, it seems there are 
still other cases that call the `close` method directly. Can you fix them 
passingly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2218) Fix missing HoodieWriteStat in HoodieCreateHandle

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2218:

Status: In Progress  (was: Open)

> Fix missing HoodieWriteStat in HoodieCreateHandle
> -
>
> Key: HUDI-2218
> URL: https://issues.apache.org/jira/browse/HUDI-2218
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Major
>  Labels: pull-request-available
>
> Some HoodieWriteStat being compute during the runtime was lost(e.g. compute 
> min event time in the payload), we need to initialize the HoodieWriteStat 
> when initializing the handle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2218) Fix missing HoodieWriteStat in HoodieCreateHandle

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2218.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Fix missing HoodieWriteStat in HoodieCreateHandle
> -
>
> Key: HUDI-2218
> URL: https://issues.apache.org/jira/browse/HUDI-2218
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Some HoodieWriteStat being compute during the runtime was lost(e.g. compute 
> min event time in the payload), we need to initialize the HoodieWriteStat 
> when initializing the handle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2253) Reduce CI run time for deltastreamer and bulk insert row writer tests

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2253.
-
Resolution: Fixed

> Reduce CI run time for deltastreamer and bulk insert row writer tests
> -
>
> Key: HUDI-2253
> URL: https://issues.apache.org/jira/browse/HUDI-2253
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Reduce CI run time for deltastreamer and bulk insert row writer tests
>  
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2117) Unpersist the input rdd after the commit is completed

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2117.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Unpersist the input rdd after the commit is completed
> -
>
> Key: HUDI-2117
> URL: https://issues.apache.org/jira/browse/HUDI-2117
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2117) Unpersist the input rdd after the commit is completed

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2117:

Status: In Progress  (was: Open)

> Unpersist the input rdd after the commit is completed
> -
>
> Key: HUDI-2117
> URL: https://issues.apache.org/jira/browse/HUDI-2117
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2044) Extend support for rockDB and compression for Spillable map to all consumers of ExternalSpillableMap

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2044.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Extend support for rockDB and compression for Spillable map to all consumers 
> of ExternalSpillableMap
> 
>
> Key: HUDI-2044
> URL: https://issues.apache.org/jira/browse/HUDI-2044
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> # HUDI-2028 only implements rockDb support for Spillable map in 
> HoodieMergeHandle since we are blocked on the configuration refactor PR to 
> land
>  # This ticket will track the implementation to extend rocksDB (and 
> compression for bit cask) support for Spoilable Map to all consumers of 
> ExternalSpillableMap.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2044) Extend support for rockDB and compression for Spillable map to all consumers of ExternalSpillableMap

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2044:

Status: In Progress  (was: Open)

> Extend support for rockDB and compression for Spillable map to all consumers 
> of ExternalSpillableMap
> 
>
> Key: HUDI-2044
> URL: https://issues.apache.org/jira/browse/HUDI-2044
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
>
> # HUDI-2028 only implements rockDb support for Spillable map in 
> HoodieMergeHandle since we are blocked on the configuration refactor PR to 
> land
>  # This ticket will track the implementation to extend rocksDB (and 
> compression for bit cask) support for Spoilable Map to all consumers of 
> ExternalSpillableMap.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2223) Fix Alter Partitioned Table Failed

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2223:

Fix Version/s: 0.9.0

> Fix Alter Partitioned Table Failed
> --
>
> Key: HUDI-2223
> URL: https://issues.apache.org/jira/browse/HUDI-2223
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Fix crash when add column to  a partitioned table:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Partition column 
> name dt conflicts with table columns.
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:962)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:216)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:495)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2217) Fix no value present in incremental query on MOR

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2217.
-
Resolution: Fixed

> Fix no value present in incremental query on MOR
> 
>
> Key: HUDI-2217
> URL: https://issues.apache.org/jira/browse/HUDI-2217
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2214:

Status: In Progress  (was: Open)

> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2214.
-
Resolution: Fixed

> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1848) Add support for HMS in Hive-sync-tool

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-1848.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Add support for HMS in Hive-sync-tool
> -
>
> Key: HUDI-1848
> URL: https://issues.apache.org/jira/browse/HUDI-1848
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jagmeet Bali
>Priority: Minor
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> Add support for HMS in Hive-sync-tool
> Currently there are two ways to sun DDL queries in hive-sync-tool. 
> This work adds on top of 
> [https://github.com/apache/hudi/pull/2532|https://github.com/apache/hudi/pull/2532/files]
> and adds a pluggable way to support 
> new way to run DDL queries using HMS. 
>  
> Different DDL executors can be selected via diff syncConfig options
> useJDBC true -> JDBCExecutor will be used
> useJDBC false -> QlHiveQueryExecutor will be used
> useHMS true -> HMSDDLExecutor will be used.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1848) Add support for HMS in Hive-sync-tool

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1848:

Status: In Progress  (was: Open)

> Add support for HMS in Hive-sync-tool
> -
>
> Key: HUDI-1848
> URL: https://issues.apache.org/jira/browse/HUDI-1848
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jagmeet Bali
>Priority: Minor
>  Labels: pull-request-available, sev:normal
>
> Add support for HMS in Hive-sync-tool
> Currently there are two ways to sun DDL queries in hive-sync-tool. 
> This work adds on top of 
> [https://github.com/apache/hudi/pull/2532|https://github.com/apache/hudi/pull/2532/files]
> and adds a pluggable way to support 
> new way to run DDL queries using HMS. 
>  
> Different DDL executors can be selected via diff syncConfig options
> useJDBC true -> JDBCExecutor will be used
> useJDBC false -> QlHiveQueryExecutor will be used
> useHMS true -> HMSDDLExecutor will be used.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2139) MergeInto MOR Table May Result InCorrect Result

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2139.
-
Resolution: Fixed

> MergeInto MOR Table May Result InCorrect Result
> ---
>
> Key: HUDI-2139
> URL: https://issues.apache.org/jira/browse/HUDI-2139
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we process all the update-action and inert-action in the 
> ExpressionPayload#
> getInsertValue without know whether the record is matched or not matched for 
> MOR table. This may result in incorrect merge result. e.g.
> {code:java}
> Merge into h0
> using (select 2 as id, 'a1' as name, 10 as price from s) s0
> on h0.id = s0.id
> when matched then s0.id = 1 the update set id = s0.id, name = s0.name, price 
> = 10
> when not matched then s0.id = 2 the insert (id,name,price) values(id,name, 
> 20){code}
> If the id = 2 can matched the target table h0,  but it cannot match the 
> udpate-condition ( s0.id = 1),  It should not update the table. However, 
> currently we cannot know the matched state of the input record, it will goes 
> to the not-matched actions and update the price to 20 finally. This is 
> incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2045) Support Read Hoodie As DataSource Table For Flink And DeltaStreamer

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2045.
-
Resolution: Fixed

> Support Read Hoodie As DataSource Table For Flink And DeltaStreamer
> ---
>
> Key: HUDI-2045
> URL: https://issues.apache.org/jira/browse/HUDI-2045
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we only support reading hoodie table as datasource table for spark 
> since [https://github.com/apache/hudi/pull/2283]
> In order to support this feature for flink and DeltaStreamer, we need to sync 
> the spark table properties needed by datasource table to the meta store in 
> HiveSyncTool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2045) Support Read Hoodie As DataSource Table For Flink And DeltaStreamer

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2045:

Status: In Progress  (was: Open)

> Support Read Hoodie As DataSource Table For Flink And DeltaStreamer
> ---
>
> Key: HUDI-2045
> URL: https://issues.apache.org/jira/browse/HUDI-2045
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we only support reading hoodie table as datasource table for spark 
> since [https://github.com/apache/hudi/pull/2283]
> In order to support this feature for flink and DeltaStreamer, we need to sync 
> the spark table properties needed by datasource table to the meta store in 
> HiveSyncTool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2139) MergeInto MOR Table May Result InCorrect Result

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2139:

Status: In Progress  (was: Open)

> MergeInto MOR Table May Result InCorrect Result
> ---
>
> Key: HUDI-2139
> URL: https://issues.apache.org/jira/browse/HUDI-2139
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we process all the update-action and inert-action in the 
> ExpressionPayload#
> getInsertValue without know whether the record is matched or not matched for 
> MOR table. This may result in incorrect merge result. e.g.
> {code:java}
> Merge into h0
> using (select 2 as id, 'a1' as name, 10 as price from s) s0
> on h0.id = s0.id
> when matched then s0.id = 1 the update set id = s0.id, name = s0.name, price 
> = 10
> when not matched then s0.id = 2 the insert (id,name,price) values(id,name, 
> 20){code}
> If the id = 2 can matched the target table h0,  but it cannot match the 
> udpate-condition ( s0.id = 1),  It should not update the table. However, 
> currently we cannot know the matched state of the input record, it will goes 
> to the not-matched actions and update the price to 20 finally. This is 
> incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2205) Rollback inflight compaction for flink writer

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2205:

Status: In Progress  (was: Open)

> Rollback inflight compaction for flink writer
> -
>
> Key: HUDI-2205
> URL: https://issues.apache.org/jira/browse/HUDI-2205
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2205) Rollback inflight compaction for flink writer

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2205.
-
Resolution: Fixed

> Rollback inflight compaction for flink writer
> -
>
> Key: HUDI-2205
> URL: https://issues.apache.org/jira/browse/HUDI-2205
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2195) Sync Hive Failed When Execute CTAS In Spark2 And Spark3

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2195.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Sync Hive Failed When Execute  CTAS In Spark2 And Spark3
> 
>
> Key: HUDI-2195
> URL: https://issues.apache.org/jira/browse/HUDI-2195
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When execute CTAS in spark2, the follow exception will throw out:
> {code:java}
> java.lang.NoClassDefFoundError: org/json/JSONException
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
>   at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
>   at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> {code}
> While executing CTAS in spark3, the follow exception throw out:
> {code:java}
> java.lang.NoClassDefFoundError: 
> org/apache/calcite/rel/type/RelDataTypeSystemjava.lang.NoClassDefFoundError: 
> org/apache/calcite/rel/type/RelDataTypeSystem at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory.get(SemanticAnalyzerFactory.java:318)
>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:484) at 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:458)
>  at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:448)
>  at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:426)
>  at 
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:322) 
> at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:234) at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:179) at 
> org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:130)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2195) Sync Hive Failed When Execute CTAS In Spark2 And Spark3

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2195:

Status: In Progress  (was: Open)

> Sync Hive Failed When Execute  CTAS In Spark2 And Spark3
> 
>
> Key: HUDI-2195
> URL: https://issues.apache.org/jira/browse/HUDI-2195
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
>
> When execute CTAS in spark2, the follow exception will throw out:
> {code:java}
> java.lang.NoClassDefFoundError: org/json/JSONException
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
>   at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
>   at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> {code}
> While executing CTAS in spark3, the follow exception throw out:
> {code:java}
> java.lang.NoClassDefFoundError: 
> org/apache/calcite/rel/type/RelDataTypeSystemjava.lang.NoClassDefFoundError: 
> org/apache/calcite/rel/type/RelDataTypeSystem at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory.get(SemanticAnalyzerFactory.java:318)
>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:484) at 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:458)
>  at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:448)
>  at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:426)
>  at 
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:322) 
> at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:234) at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:179) at 
> org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:130)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1860) Add INSERT_OVERWRITE support to DeltaStreamer

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-1860.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Add INSERT_OVERWRITE support to DeltaStreamer
> -
>
> Key: HUDI-1860
> URL: https://issues.apache.org/jira/browse/HUDI-1860
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Samrat Deb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> As discussed in [this 
> RFC|https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller],
>  having full fetch mode use the inser_overwrite to write to sync would be 
> better as it can handle schema changes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1860) Add INSERT_OVERWRITE support to DeltaStreamer

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1860:

Status: In Progress  (was: Open)

> Add INSERT_OVERWRITE support to DeltaStreamer
> -
>
> Key: HUDI-1860
> URL: https://issues.apache.org/jira/browse/HUDI-1860
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Samrat Deb
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> As discussed in [this 
> RFC|https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller],
>  having full fetch mode use the inser_overwrite to write to sync would be 
> better as it can handle schema changes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1447) DeltaStreamer kafka source supports consuming from specified timestamp

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-1447.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> DeltaStreamer kafka source supports consuming from specified timestamp
> --
>
> Key: HUDI-1447
> URL: https://issues.apache.org/jira/browse/HUDI-1447
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: wangxianghu#1
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1447) DeltaStreamer kafka source supports consuming from specified timestamp

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1447:

Status: In Progress  (was: Open)

> DeltaStreamer kafka source supports consuming from specified timestamp
> --
>
> Key: HUDI-1447
> URL: https://issues.apache.org/jira/browse/HUDI-1447
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: wangxianghu#1
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available, sev:high, user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1633) Make callback return HoodieWriteStat

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1633:

Fix Version/s: 0.9.0

> Make callback return HoodieWriteStat
> 
>
> Key: HUDI-1633
> URL: https://issues.apache.org/jira/browse/HUDI-1633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Minor
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2029) Implement compression for DiskBasedMap in Spillable Map

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2029.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Implement compression for DiskBasedMap in Spillable Map
> ---
>
> Key: HUDI-2029
> URL: https://issues.apache.org/jira/browse/HUDI-2029
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Implement compression for DiskBasedMap in Spillable Map 
> Without compression, DiskBasedMap is causing more spilling to disk than 
> RockDb.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1985) Website re-design implementation

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390834#comment-17390834
 ] 

ASF GitHub Bot commented on HUDI-1985:
--

vingov commented on pull request #3366:
URL: https://github.com/apache/hudi/pull/3366#issuecomment-890272138


   @nsivabalan - Can you please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Website re-design implementation
> 
>
> Key: HUDI-1985
> URL: https://issues.apache.org/jira/browse/HUDI-1985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: documentation, pull-request-available
> Fix For: 0.9.0
>
>
> To provide better navigation and organization of Hudi website's info, we have 
> done a re-design of the web pages.
> Previous discussion
> [https://github.com/apache/hudi/issues/2905]
>  
> See the wireframe and final design in 
> [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6]
> (login Figma to comment)
> The design is ready for implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] vingov commented on pull request #3366: [HUDI-1985] Migrate the hudi site to docusaurus platform (website complete re-design)

2021-07-30 Thread GitBox



vingov commented on pull request #3366:
URL: https://github.com/apache/hudi/pull/3366#issuecomment-890272138


   @nsivabalan - Can you please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-1828) Ensure All Tests Pass with ORC format

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-1828.
-
Fix Version/s: 0.9.0
 Assignee: Teresa Kang
   Resolution: Fixed

> Ensure All Tests Pass with ORC format
> -
>
> Key: HUDI-1828
> URL: https://issues.apache.org/jira/browse/HUDI-1828
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: Teresa Kang
>Assignee: Teresa Kang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Run all tests with HoodieTableConfig.DEFAULT_BASE_FILE_FORMAT=ORC, ensure all 
> tests pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2180) Fix Compile Error For Spark3

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2180:

Fix Version/s: 0.9.0

> Fix Compile Error For Spark3
> 
>
> Key: HUDI-2180
> URL: https://issues.apache.org/jira/browse/HUDI-2180
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1828) Ensure All Tests Pass with ORC format

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1828:

Status: In Progress  (was: Open)

> Ensure All Tests Pass with ORC format
> -
>
> Key: HUDI-1828
> URL: https://issues.apache.org/jira/browse/HUDI-1828
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: Teresa Kang
>Priority: Major
>  Labels: pull-request-available
>
> Run all tests with HoodieTableConfig.DEFAULT_BASE_FILE_FORMAT=ORC, ensure all 
> tests pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1771) Propagate CDC format for hoodie

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390831#comment-17390831
 ] 

ASF GitHub Bot commented on HUDI-1771:
--

swuferhong closed pull request #3285:
URL: https://github.com/apache/hudi/pull/3285


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Zheng yunhong
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1771) Propagate CDC format for hoodie

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390832#comment-17390832
 ] 

ASF GitHub Bot commented on HUDI-1771:
--

swuferhong opened a new pull request #3285:
URL: https://github.com/apache/hudi/pull/3285


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Propagate CDC format for hoodie.
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Zheng yunhong
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] swuferhong opened a new pull request #3285: [HUDI-1771] Propagate CDC format for hoodie

2021-07-30 Thread GitBox



swuferhong opened a new pull request #3285:
URL: https://github.com/apache/hudi/pull/3285


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Propagate CDC format for hoodie.
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] swuferhong closed pull request #3285: [HUDI-1771] Propagate CDC format for hoodie

2021-07-30 Thread GitBox



swuferhong closed pull request #3285:
URL: https://github.com/apache/hudi/pull/3285


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-1969) Support reading logs for MOR Hive rt table

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-1969.
-
Fix Version/s: 0.9.0
 Assignee: Danny Chen
   Resolution: Fixed

> Support reading logs for MOR Hive rt table
> --
>
> Key: HUDI-1969
> URL: https://issues.apache.org/jira/browse/HUDI-1969
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1969) Support reading logs for MOR Hive rt table

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1969:

Status: In Progress  (was: Open)

> Support reading logs for MOR Hive rt table
> --
>
> Key: HUDI-1969
> URL: https://issues.apache.org/jira/browse/HUDI-1969
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2168) AccessControlException for anonymous user

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2168:

Fix Version/s: 0.9.0

> AccessControlException for anonymous user
> -
>
> Key: HUDI-2168
> URL: https://issues.apache.org/jira/browse/HUDI-2168
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Vinay
>Assignee: Vinay
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Users are facing the following exception while executing test case dependent 
> on starting Hive service
>  
> {code:java}
> Got exception: org.apache.hadoop.security.AccessControlException Permission 
> denied: user=anonymous, access=WRITE
> {code}
> This is specifically happening at the time of clearing Hive DB
> {code:java}
> client.updateHiveSQL("drop database if exists " + 
> hiveSyncConfig.databaseName);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2144) Offline clustering(independent sparkJob) will cause insert action losing data

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2144.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> Offline clustering(independent sparkJob) will cause insert action losing data
> -
>
> Key: HUDI-2144
> URL: https://issues.apache.org/jira/browse/HUDI-2144
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
> Attachments: image-2021-07-08-13-52-00-089.png
>
>
> For now we have two kinds of pipeline for Hudi using spark:
>  # Streaming insert data to specific partition
>  # Offline clustering spark 
> job(`org.apache.hudi.utilities.HoodieClusteringJob`) to optimize file size 
> pipeline 1 created
> But here is a bug we met that will lose data
> These steps can make the problem reproduce stably ：
>  # Submit a spark job to Ingest data1 using insert mode.
>  # Schedule a clustering plan using 
> `org.apache.hudi.utilities.HoodieClusteringJob`
>  # Submit a spark job again to Ingest data2 using insert mode(Ensure that 
> there is new file slice created in the same file group which means small file 
> tuning for insert is working). Suppose this file group is called file group 1 
> and new file slice is called file slice 2.
>  # Execute that clustering job step2 planed.
>  # Query data1+data2 you will find new data for a  is lost compared with 
> common ingestion without clustering
>  
>   !image-2021-07-08-13-52-00-089.png|width=922,height=728!
> Here is the root cause:
> When ingest data using insert mode, Hudi will find small files and try to 
> append new data to them ,aiming to tuning data file size.
> [https://github.com/apache/hudi/blob/650c4455c600b0346fed8b5b6aa4cc0bf3452e8c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L149]
> is try to filter Small Files In Clustering but only works when user set 
> `hoodie.clustering.inline` true which is not good enough when users using 
> offline clustering.
> I just raise a PR try to fix it and tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2144) Offline clustering(independent sparkJob) will cause insert action losing data

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2144:

Status: In Progress  (was: Open)

> Offline clustering(independent sparkJob) will cause insert action losing data
> -
>
> Key: HUDI-2144
> URL: https://issues.apache.org/jira/browse/HUDI-2144
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-07-08-13-52-00-089.png
>
>
> For now we have two kinds of pipeline for Hudi using spark:
>  # Streaming insert data to specific partition
>  # Offline clustering spark 
> job(`org.apache.hudi.utilities.HoodieClusteringJob`) to optimize file size 
> pipeline 1 created
> But here is a bug we met that will lose data
> These steps can make the problem reproduce stably ：
>  # Submit a spark job to Ingest data1 using insert mode.
>  # Schedule a clustering plan using 
> `org.apache.hudi.utilities.HoodieClusteringJob`
>  # Submit a spark job again to Ingest data2 using insert mode(Ensure that 
> there is new file slice created in the same file group which means small file 
> tuning for insert is working). Suppose this file group is called file group 1 
> and new file slice is called file slice 2.
>  # Execute that clustering job step2 planed.
>  # Query data1+data2 you will find new data for a  is lost compared with 
> common ingestion without clustering
>  
>   !image-2021-07-08-13-52-00-089.png|width=922,height=728!
> Here is the root cause:
> When ingest data using insert mode, Hudi will find small files and try to 
> append new data to them ,aiming to tuning data file size.
> [https://github.com/apache/hudi/blob/650c4455c600b0346fed8b5b6aa4cc0bf3452e8c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L149]
> is try to filter Small Files In Clustering but only works when user set 
> `hoodie.clustering.inline` true which is not good enough when users using 
> offline clustering.
> I just raise a PR try to fix it and tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2107) Support Read Log Only MOR Table For Spark

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2107.
-
Resolution: Fixed

> Support Read Log Only MOR Table For Spark
> -
>
> Key: HUDI-2107
> URL: https://issues.apache.org/jira/browse/HUDI-2107
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we cannot support read log-only mor table(which is generated by 
> index like InMemeoryIndex, HbaseIndex and FlinkIndex which support indexing 
> log file) for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2107) Support Read Log Only MOR Table For Spark

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2107:

Status: In Progress  (was: Open)

> Support Read Log Only MOR Table For Spark
> -
>
> Key: HUDI-2107
> URL: https://issues.apache.org/jira/browse/HUDI-2107
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we cannot support read log-only mor table(which is generated by 
> index like InMemeoryIndex, HbaseIndex and FlinkIndex which support indexing 
> log file) for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2087) Support Append only in Flink stream

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2087:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Support Append only in Flink stream
> ---
>
> Key: HUDI-2087
> URL: https://issues.apache.org/jira/browse/HUDI-2087
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: image-2021-07-08-22-04-30-039.png, 
> image-2021-07-08-22-04-40-018.png
>
>
> It is necessary to support append mode in flink stream, as the data lake 
> should be able to write log type data as parquet high performance without 
> merge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-2099) hive lock which state is WATING should be released, otherwise this hive lock will be locked forever

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2099.
-
Fix Version/s: (was: 0.8.0)
   0.9.0
   Resolution: Fixed

>  hive lock which state is WATING should be released,  otherwise this hive 
> lock will be locked forever
> -
>
> Key: HUDI-2099
> URL: https://issues.apache.org/jira/browse/HUDI-2099
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> when we acquire hive lock failed and the lock state is WATING， we should 
> release this WATING lock； otherwise this hive lock will be locked forever。
> test step：
> use hive lock to control concurrent write for hudi， let‘s call this lock 
> hive_lock
> start three writers to write hudi table by using hive_lock concurrently， one 
> of the writer will failed to acquire hive lock due to competition issues。
> *Exception in thread "main" org.apache.hudi.exception.HoodieLockException: 
> Unable to acquire lock, lock object LockResponse(lockid:76, state:WAITING)*
>  
> start another writer to write hudi table by using same hive_lock， then we 
> find hive_lock is locked forever, we have no way to acquire it
> *Exception in thread "main" org.apache.hudi.exception.HoodieLockException: 
> Unable to acquire lock, lock object LockResponse(lockid:87, state:WAITING)*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2099) hive lock which state is WATING should be released, otherwise this hive lock will be locked forever

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2099:

Status: In Progress  (was: Open)

>  hive lock which state is WATING should be released,  otherwise this hive 
> lock will be locked forever
> -
>
> Key: HUDI-2099
> URL: https://issues.apache.org/jira/browse/HUDI-2099
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> when we acquire hive lock failed and the lock state is WATING， we should 
> release this WATING lock； otherwise this hive lock will be locked forever。
> test step：
> use hive lock to control concurrent write for hudi， let‘s call this lock 
> hive_lock
> start three writers to write hudi table by using hive_lock concurrently， one 
> of the writer will failed to acquire hive lock due to competition issues。
> *Exception in thread "main" org.apache.hudi.exception.HoodieLockException: 
> Unable to acquire lock, lock object LockResponse(lockid:76, state:WAITING)*
>  
> start another writer to write hudi table by using same hive_lock， then we 
> find hive_lock is locked forever, we have no way to acquire it
> *Exception in thread "main" org.apache.hudi.exception.HoodieLockException: 
> Unable to acquire lock, lock object LockResponse(lockid:87, state:WAITING)*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390829#comment-17390829
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680212711



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   Here is my thought on choosing the right operation. Having too many case 
statements might complicate things and is error prone too. As I mentioned 
earlier, we should try to do any valid conversions in HoodiesSparkSqlWriter. 
Only those thats applicable just to sql dml, we should keep it here. 
   Anyways, here is one simplified approach. Ignoring the primary, non primary 
key table for now. We can come back to that later once we have consensus on 
this. 
   
   We need just two configs. 
   hoodie.sql.enable.bulk_insert (default false)
   hoodie.sql.overwrite.entire.table (default true)
   
   From sql syntax, there are two commands allowed. 
   "INSERT" into and "INSERT OVERWRITE". And these need to map to 4 operations 
on the hudi end (insert, bulk_insert, insert over write and insert overwrite 
table)
   
   "INSERT" with no other configs set -> insert operation
   "INSERT" with enable bulk insert set -> bulk_insert
   "INSERT OVERWRITE" with no other configs set -> insert_overwrite_table 
operation
   "INSERT OVERWRITE" with hoodie.sql.overwrite.entire.table = false -> 
insert_overwrite operation.
   "INSERT OVERWRITE" with enable bulk_insert set -> bulk_insert. pass the 
right save mode to HoodieSparkSqlWriter
   "INSERT OVERWRITE" with enable bulk_insert set and 
hoodie.sql.overwrite.entire.table = false -> bulk_insert. pass the right save 
mode to HoodieSparkSqlWriter.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan commented on a change in pull request #3328: [HUDI-2208] Support Bulk Insert For Spark Sql

2021-07-30 Thread GitBox



nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680212711



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   Here is my thought on choosing the right operation. Having too many case 
statements might complicate things and is error prone too. As I mentioned 
earlier, we should try to do any valid conversions in HoodiesSparkSqlWriter. 
Only those thats applicable just to sql dml, we should keep it here. 
   Anyways, here is one simplified approach. Ignoring the primary, non primary 
key table for now. We can come back to that later once we have consensus on 
this. 
   
   We need just two configs. 
   hoodie.sql.enable.bulk_insert (default false)
   hoodie.sql.overwrite.entire.table (default true)
   
   From sql syntax, there are two commands allowed. 
   "INSERT" into and "INSERT OVERWRITE". And these need to map to 4 operations 
on the hudi end (insert, bulk_insert, insert over write and insert overwrite 
table)
   
   "INSERT" with no other configs set -> insert operation
   "INSERT" with enable bulk insert set -> bulk_insert
   "INSERT OVERWRITE" with no other configs set -> insert_overwrite_table 
operation
   "INSERT OVERWRITE" with hoodie.sql.overwrite.entire.table = false -> 
insert_overwrite operation.
   "INSERT OVERWRITE" with enable bulk_insert set -> bulk_insert. pass the 
right save mode to HoodieSparkSqlWriter
   "INSERT OVERWRITE" with enable bulk_insert set and 
hoodie.sql.overwrite.entire.table = false -> bulk_insert. pass the right save 
mode to HoodieSparkSqlWriter.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-1842) [SQL] Spark Sql Support For The Exists Hoodie Table

2021-07-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390817#comment-17390817
 ] 

sivabalan narayanan edited comment on HUDI-1842 at 7/30/21, 11:57 PM:
--

I was just playing around. Just dumping my findings here. My intention: If 
incase we upgrade/update the hoodie.properties with appropriate entries, what 
it takes to start using it in spark-sql. 

 

I tried creating a table via spark shell as per quick start. And then executed 
this command via spark-sql

create table hudi_cow1 (begin_lat double, begin_lon double, driver string, 
end_lat double, end_lon double, fare double, partitionpath string, rider 
string, ts bigint, uuid string) using hudi options(primaryKey = 'uuid', 
precombineField = 'ts') partitioned by (partitionpath) location 
'file:///tmp/hudi_cow/';

table name has to match the table name as per hoodie.properties. 

Note: I created this table with latest master and so it had all the required 
properties required for sql even though it was created with spark ds. 

 

After this, I tried to insert records via spark-sql

insert into hudi_cow1 values(1.0, 2.0, "driver_1", 3.0, 4.0, 100.0, "rider_1", 
12345, "ajsdfih23498q405qtahgkfsg", "americas/united_states/san_francisco/");

 

I see that for record key and partition path, respective field names are 
prefixed to col values for meta fields. 

Result of select command. 

// showing 2 rows. 1 row was inserted via spark-shell and another one(2nd row) 
is inserted via spark-sql. 
{code:java}
20210730180218  20210730180218_1_8  ef9f4d56-12e0-4266-91ad-c4bca0580db6
americas/united_states/san_francisco
14e81925-2479-4a57-a932-42d1078fe988-0_1-27-28_20210730180218.parque0.1856488085068272
  0.9694586417848392  driver-213  0.38186367037201974 
0.25252652214479043 33.92216483948643   rider-213   1627136598584   
ef9f4d56-12e0-4266-91ad-c4bca0580db6americas/united_states/san_francisco

20210730190704  20210730190704_0_1001   uuid:ajsdfih23498q405qtahgkfsg  
partitionpath=americas%2Funited_states%2Fsan_francisco%2F   
9a350a54-bb5d-4aba-bf5e-bbcc665c4449-0_0-66-3383_20210730190704.parquet 1.0 
2.0 driver_13.0 4.0 100.0   rider_1 
1234ajsdfih23498q405qtahgkfsg   americas/united_states/san_francisco/
{code}
 

 

 


was (Author: shivnarayan):
I was just playing around. Just dumping my findings here. My intention: If 
incase we update the hoodie.properties with appropriate entries, what it takes 
to start using it in spark-sql. 

 

I tried creating a table via spark shell as per quick start. And then executed 
this command via spark-sql

create table hudi_cow1 (begin_lat double, begin_lon double, driver string, 
end_lat double, end_lon double, fare double, partitionpath string, rider 
string, ts bigint, uuid string) using hudi options(primaryKey = 'uuid', 
precombineField = 'ts') partitioned by (partitionpath) location 
'file:///tmp/hudi_cow/';

table name has to match the table name as per hoodie.properties. 

Note: I created this table with latest master and so it had all the required 
properties required for sql even though it was created with spark ds. 

 

After this, I tried to insert records

insert into hudi_cow1 values(1.0, 2.0, "driver_1", 3.0, 4.0, 100.0, "rider_1", 
12345, "ajsdfih23498q405qtahgkfsg", "americas/united_states/san_francisco/");

 

I see that for record key and partition path, respective field names are 
prefixed to col values for meta fields. 

Result of select command. 

// showing 2 rows. 1 row was inserted via spark-shell and another one(2nd row) 
is inserted via spark-sql. 
{code:java}
20210730180218  20210730180218_1_8  ef9f4d56-12e0-4266-91ad-c4bca0580db6
americas/united_states/san_francisco
14e81925-2479-4a57-a932-42d1078fe988-0_1-27-28_20210730180218.parque0.1856488085068272
  0.9694586417848392  driver-213  0.38186367037201974 
0.25252652214479043 33.92216483948643   rider-213   1627136598584   
ef9f4d56-12e0-4266-91ad-c4bca0580db6americas/united_states/san_francisco

20210730190704  20210730190704_0_1001   uuid:ajsdfih23498q405qtahgkfsg  
partitionpath=americas%2Funited_states%2Fsan_francisco%2F   
9a350a54-bb5d-4aba-bf5e-bbcc665c4449-0_0-66-3383_20210730190704.parquet 1.0 
2.0 driver_13.0 4.0 100.0   rider_1 
1234ajsdfih23498q405qtahgkfsg   americas/united_states/san_francisco/
{code}
 

 

 

> [SQL] Spark Sql Support For The Exists Hoodie Table
> ---
>
> Key: HUDI-1842
> URL: https://issues.apache.org/jira/browse/HUDI-1842
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.9.0
>
>
> In order to support spark sql for

[jira] [Comment Edited] (HUDI-1842) [SQL] Spark Sql Support For The Exists Hoodie Table

2021-07-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390817#comment-17390817
 ] 

sivabalan narayanan edited comment on HUDI-1842 at 7/30/21, 11:52 PM:
--

I was just playing around. Just dumping my findings here. My intention: If 
incase we update the hoodie.properties with appropriate entries, what it takes 
to start using it in spark-sql. 

 

I tried creating a table via spark shell as per quick start. And then executed 
this command via spark-sql

create table hudi_cow1 (begin_lat double, begin_lon double, driver string, 
end_lat double, end_lon double, fare double, partitionpath string, rider 
string, ts bigint, uuid string) using hudi options(primaryKey = 'uuid', 
precombineField = 'ts') partitioned by (partitionpath) location 
'file:///tmp/hudi_cow/';

table name has to match the table name as per hoodie.properties. 

Note: I created this table with latest master and so it had all the required 
properties required for sql even though it was created with spark ds. 

 

After this, I tried to insert records

insert into hudi_cow1 values(1.0, 2.0, "driver_1", 3.0, 4.0, 100.0, "rider_1", 
12345, "ajsdfih23498q405qtahgkfsg", "americas/united_states/san_francisco/");

 

I see that for record key and partition path, respective field names are 
prefixed to col values for meta fields. 

Result of select command. 

// showing 2 rows. 1 row was inserted via spark-shell and another one(2nd row) 
is inserted via spark-sql. 
{code:java}
20210730180218  20210730180218_1_8  ef9f4d56-12e0-4266-91ad-c4bca0580db6
americas/united_states/san_francisco
14e81925-2479-4a57-a932-42d1078fe988-0_1-27-28_20210730180218.parque0.1856488085068272
  0.9694586417848392  driver-213  0.38186367037201974 
0.25252652214479043 33.92216483948643   rider-213   1627136598584   
ef9f4d56-12e0-4266-91ad-c4bca0580db6americas/united_states/san_francisco

20210730190704  20210730190704_0_1001   uuid:ajsdfih23498q405qtahgkfsg  
partitionpath=americas%2Funited_states%2Fsan_francisco%2F   
9a350a54-bb5d-4aba-bf5e-bbcc665c4449-0_0-66-3383_20210730190704.parquet 1.0 
2.0 driver_13.0 4.0 100.0   rider_1 
1234ajsdfih23498q405qtahgkfsg   americas/united_states/san_francisco/
{code}
 

 

 


was (Author: shivnarayan):
I was just playing around. Just dumping my findings here. 

I tried creating a table via spark shell as per quick start. And then executed 
this command via spark-sql

create table hudi_cow1 (begin_lat double, begin_lon double, driver string, 
end_lat double, end_lon double, fare double, partitionpath string, rider 
string, ts bigint, uuid string) using hudi options(primaryKey = 'uuid', 
precombineField = 'ts') partitioned by (partitionpath) location 
'file:///tmp/hudi_cow/';

table name has to match the table name as per hoodie.properties. 

Note: I created this table with latest master and so it had all the required 
properties required for sql even though it was created with spark ds. 

 

After this, I tried to insert records

insert into hudi_cow1 values(1.0, 2.0, "driver_1", 3.0, 4.0, 100.0, "rider_1", 
12345, "ajsdfih23498q405qtahgkfsg", "americas/united_states/san_francisco/");

 

I see that for record key and partition path, respective field names are 
prefixed to col values for meta fields. 

Result of select command. 

// showing 2 rows. 1 row was inserted via spark-shell and another one(2nd row) 
is inserted via spark-sql. 
{code:java}
20210730180218  20210730180218_1_8  ef9f4d56-12e0-4266-91ad-c4bca0580db6
americas/united_states/san_francisco
14e81925-2479-4a57-a932-42d1078fe988-0_1-27-28_20210730180218.parque0.1856488085068272
  0.9694586417848392  driver-213  0.38186367037201974 
0.25252652214479043 33.92216483948643   rider-213   1627136598584   
ef9f4d56-12e0-4266-91ad-c4bca0580db6americas/united_states/san_francisco

20210730190704  20210730190704_0_1001   uuid:ajsdfih23498q405qtahgkfsg  
partitionpath=americas%2Funited_states%2Fsan_francisco%2F   
9a350a54-bb5d-4aba-bf5e-bbcc665c4449-0_0-66-3383_20210730190704.parquet 1.0 
2.0 driver_13.0 4.0 100.0   rider_1 
1234ajsdfih23498q405qtahgkfsg   americas/united_states/san_francisco/
{code}
 

 

 

> [SQL] Spark Sql Support For The Exists Hoodie Table
> ---
>
> Key: HUDI-1842
> URL: https://issues.apache.org/jira/browse/HUDI-1842
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.9.0
>
>
> In order to support spark sql for hoodie, we persist some table properties to 
> the hoodie.properties. e.g. primaryKey, preCombineField, partition columns.  
> For the exists hoodie

[jira] [Comment Edited] (HUDI-1842) [SQL] Spark Sql Support For The Exists Hoodie Table

2021-07-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390817#comment-17390817
 ] 

sivabalan narayanan edited comment on HUDI-1842 at 7/30/21, 11:48 PM:
--

I was just playing around. Just dumping my findings here. 

I tried creating a table via spark shell as per quick start. And then executed 
this command via spark-sql

create table hudi_cow1 (begin_lat double, begin_lon double, driver string, 
end_lat double, end_lon double, fare double, partitionpath string, rider 
string, ts bigint, uuid string) using hudi options(primaryKey = 'uuid', 
precombineField = 'ts') partitioned by (partitionpath) location 
'file:///tmp/hudi_cow/';

table name has to match the table name as per hoodie.properties. 

Note: I created this table with latest master and so it had all the required 
properties required for sql even though it was created with spark ds. 

 

After this, I tried to insert records

insert into hudi_cow1 values(1.0, 2.0, "driver_1", 3.0, 4.0, 100.0, "rider_1", 
12345, "ajsdfih23498q405qtahgkfsg", "americas/united_states/san_francisco/");

 

I see that for record key and partition path, respective field names are 
prefixed to col values for meta fields. 

Result of select command. 

// showing 2 rows. 1 row was inserted via spark-shell and another one(2nd row) 
is inserted via spark-sql. 
{code:java}
20210730180218  20210730180218_1_8  ef9f4d56-12e0-4266-91ad-c4bca0580db6
americas/united_states/san_francisco
14e81925-2479-4a57-a932-42d1078fe988-0_1-27-28_20210730180218.parque0.1856488085068272
  0.9694586417848392  driver-213  0.38186367037201974 
0.25252652214479043 33.92216483948643   rider-213   1627136598584   
ef9f4d56-12e0-4266-91ad-c4bca0580db6americas/united_states/san_francisco

20210730190704  20210730190704_0_1001   uuid:ajsdfih23498q405qtahgkfsg  
partitionpath=americas%2Funited_states%2Fsan_francisco%2F   
9a350a54-bb5d-4aba-bf5e-bbcc665c4449-0_0-66-3383_20210730190704.parquet 1.0 
2.0 driver_13.0 4.0 100.0   rider_1 
1234ajsdfih23498q405qtahgkfsg   americas/united_states/san_francisco/
{code}
 

 

 


was (Author: shivnarayan):
I was just playing around. Just dumping my findings here. 

I tried creating a table via spark shell as per quick start. And then executed 
this command via spark-sql

create table hudi_cow1 (begin_lat double, begin_lon double, driver string, 
end_lat double, end_lon double, fare double, partitionpath string, rider 
string, ts bigint, uuid string) using hudi options(primaryKey = 'uuid', 
precombineField = 'ts') partitioned by (partitionpath) location 
'file:///tmp/hudi_cow/';

table name has to match the table name as per hoodie.properties. 

Note: I created this table with latest master and so it had all the required 
properties required for sql even though it was created with spark ds. 

 

After this, I tried to insert records

insert into hudi_cow1 values(1.0, 2.0, "driver_1", 3.0, 4.0, 100.0, "rider_1", 
12345, "ajsdfih23498q405qtahgkfsg", "americas/united_states/san_francisco/");

 

I see that for record key and partition path, respective field names are 
prefixed to col values. 

Result of select command. 

// showing 2 rows. 1 row was inserted via spark-shell and another one(2nd row) 
is inserted via spark-sql. 
{code:java}
20210730180218  20210730180218_1_8  ef9f4d56-12e0-4266-91ad-c4bca0580db6
americas/united_states/san_francisco
14e81925-2479-4a57-a932-42d1078fe988-0_1-27-28_20210730180218.parque0.1856488085068272
  0.9694586417848392  driver-213  0.38186367037201974 
0.25252652214479043 33.92216483948643   rider-213   1627136598584   
ef9f4d56-12e0-4266-91ad-c4bca0580db6americas/united_states/san_francisco

20210730190704  20210730190704_0_1001   uuid:ajsdfih23498q405qtahgkfsg  
partitionpath=americas%2Funited_states%2Fsan_francisco%2F   
9a350a54-bb5d-4aba-bf5e-bbcc665c4449-0_0-66-3383_20210730190704.parquet 1.0 
2.0 driver_13.0 4.0 100.0   rider_1 
1234ajsdfih23498q405qtahgkfsg   americas/united_states/san_francisco/
{code}
 

 

 

> [SQL] Spark Sql Support For The Exists Hoodie Table
> ---
>
> Key: HUDI-1842
> URL: https://issues.apache.org/jira/browse/HUDI-1842
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.9.0
>
>
> In order to support spark sql for hoodie, we persist some table properties to 
> the hoodie.properties. e.g. primaryKey, preCombineField, partition columns.  
> For the exists hoodie tables, these  properties are missing. We need do some 
> code in UpgradeDowngrade to support spark sql for the exists tables.



--
This message was sent

[jira] [Resolved] (HUDI-2115) FileSlices in the filegroup is not descending by timestamp

2021-07-30 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2115.
-
Fix Version/s: 0.9.0
   Resolution: Fixed

> FileSlices in the filegroup is not descending by timestamp
> --
>
> Key: HUDI-2115
> URL: https://issues.apache.org/jira/browse/HUDI-2115
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 >

1 - 100 of 403 matches

Mail list logo