[GitHub] [hudi] n3nash edited a comment on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


n3nash edited a comment on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-758475039


   @pratyakshsharma in that case, can you review this PR ? @prashantwason Had 
missed to push some local changes, can you take another pass, I think it should 
address all your comments. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua edited a comment on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


yanghua edited a comment on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758474842


   > The file check of each task is useless because even if a task of the 
source has no data for some time interval, the checkpoint still can trigger 
normally. So all task checkpoint successfully does not mean there is data. (I 
have solved this in mu #step1 PR though)
   
   That's true. We could remove the file check logic. While the implementation 
of multiple parallelisms is reasonable.
   
   > There is no need to checkpoint the write status in 
KeyedWriteProcessOperator, because we can not start a new instant if the last 
instant failes, the more proper/simple way is to retry the commit actions 
several times and trigger failover if still fails.
   
   Sounds reasonable, agree.
   
   > BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves 
many bugs and has many improvements. After that i would add a compatible 
pipeline and this PR can apply there, and i can help to review.
   
   I personally think that a better form of community participation is:
   
   1) Control the granularity of changes;
   2) Each submission is a complete function point, so that the working 
behavior of the code does not change;
   
   I actually want to know if you have the idea of the compatible version of 
`OperatorCoordinator`. What is the design? Will it be better than the current 
one? Will it be better than this PR design?
   
   All this is opaque.
   
   Your Step 1 is a large-scale refactoring, and the merged code will make this 
client immediately unavailable. It is currently in a critical period before the 
release of 0.7 (if we do not have the energy to merge step 2, 3 in the short 
term?). Why can't we optimize it step by step?
   
   In fact, the first and second step you need to optimize is File Assigner. SF 
Express has already implemented it and is ready to provide PR. In fact, we 
improve on the existing basis, and risks and changes are controllable, right? I 
think that in the end, we provide a more "elegant implementation" of 
OperatorCoordinator for the higher version, which is the correct order.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


n3nash commented on a change in pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#discussion_r71734



##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord 
left, GenericRecord righ
 return result;
   }
 
-  /**
-   * Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the old
-   * schema.
-   */
-  public static GenericRecord rewriteRecord(GenericRecord record, Schema 
newSchema) {
-return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), 
newSchema), newSchema);
-  }
-
   /**
* Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the new
* schema.
+   * NOTE: Here, the assumption is that you cannot go from an evolved schema 
(schema with (N) fields)
+   * to an older schema (schema with (N-1) fields). All fields present in the 
older record schema MUST be present in the
+   * new schema and the default/existing values are carried over.
+   * This particular method does the following things :
+   * a) Create a new empty GenericRecord with the new schema.
+   * b) Set default values for all fields of this transformed schema in the 
new GenericRecord or copy over the data
+   * from the old schema to the new schema
+   * c) hoodie_metadata_fields have a special treatment. This is done because 
for code generated AVRO classes
+   * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type 
instead of a GenericRecord.
+   * SpecificBaseRecord throws null pointer exception for record.get(name) if 
name is not present in the schema of the
+   * record (which happens when converting a SpecificBaseRecord without 
hoodie_metadata_fields to a new record with.
*/
-  public static GenericRecord 
rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) {
-return rewrite(record, new LinkedHashSet<>(newSchema.getFields()), 
newSchema);
-  }
-
-  private static GenericRecord rewrite(GenericRecord record, 
LinkedHashSet fieldsToWrite, Schema newSchema) {
+  public static GenericRecord rewriteRecord(GenericRecord oldRecord, Schema 
newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
-for (Schema.Field f : fieldsToWrite) {
-  if (record.get(f.name()) == null) {
+boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase;
+for (Schema.Field f : newSchema.getFields()) {
+  if (!isMetadataField(f.name()) && oldRecord.get(f.name()) == null) {
+// if not metadata field, set defaults
 if (f.defaultVal() instanceof JsonProperties.Null) {
   newRecord.put(f.name(), null);
 } else {
   newRecord.put(f.name(), f.defaultVal());
 }
-  } else {
-newRecord.put(f.name(), record.get(f.name()));
+  } else if (!isMetadataField(f.name())) {
+// if not metadata field, copy old value
+newRecord.put(f.name(), oldRecord.get(f.name()));
+  } else if (!isSpecificRecord) {
+// if not specific record, copy value for hoodie metadata fields as 
well

Review comment:
   We don't need to, updated the comments, please take a read and let me 
know if it's clear 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


n3nash commented on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-758475039


   @pratyakshsharma in that case, can you review this PR ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


yanghua commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758474842


   > The file check of each task is useless because even if a task of the 
source has no data for some time interval, the checkpoint still can trigger 
normally. So all task checkpoint successfully does not mean there is data. (I 
have solved this in mu #step1 PR though)
   
   That's true. We could remove the file check logic. While the implementation 
of multiple parallelisms is reasonable.
   
   > There is no need to checkpoint the write status in 
KeyedWriteProcessOperator, because we can not start a new instant if the last 
instant failes, the more proper/simple way is to retry the commit actions 
several times and trigger failover if still fails.
   
   Sounds reasonable, agree.
   
   > BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves 
many bugs and has many improvements. After that i would add a compatible 
pipeline and this PR can apply there, and i can help to review.
   
   I personally think that a better form of community participation is:
   
   1) Control the granularity of changes;
   2) Each submission is a complete function point, so that the working 
behavior of the code does not change;
   
   I actually want to know if you have removed the compatible version of 
`OperatorCoordinator`. What is the design? Will it be better than the current 
one? Will it be better than this PR design?
   
   All this is opaque.
   
   Your Step 1 is a large-scale refactoring, and the merged code will make this 
client immediately unavailable. It is currently in a critical period before the 
release of 0.7 (if we do not have the energy to merge step 2, 3 in the short 
term?). Why can't we optimize it step by step?
   
   In fact, the first and second step you need to optimize is File Assigner. SF 
Express has already implemented it and is ready to provide PR. In fact, we 
improve on the existing basis, and risks and changes are controllable, right? I 
think that in the end, we provide a more "elegant implementation" of 
OperatorCoordinator for the higher version, which is the correct order.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


pratyakshsharma commented on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-758461542


   @n3nash In my previous org, we were dealing with a similar scenario where 
fields were getting deleted from few tables in production. Yeah parquet-avro 
reader will throw exception in the scenario you mentioned. We were actually 
using schema-registry to create and store an uber schema so that every field is 
present in the final schema before actually writing to parquet files. We 
created the uber schema at the start of DeltaStreamer, and used the same for 
the ingestion. 
   
   I guess all this is beyond the scope of this PR. We can initiate a separate 
discussion to support deletion of fields from schema.  :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] loukey-lj opened a new pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


loukey-lj opened a new pull request #2434:
URL: https://github.com/apache/hudi/pull/2434


InstantGenerateOperator support multiple parallelism.
   When InstantGenerateOperator subtask size greater than 1 we can set subtask 
0 as a main subtask, only main task create new instant.
   The prerequisite of create new instant is exist subtask received data in 
current checkpoint. Every subtask will create a tmp file,
   flie name is make up by checkpointid,subtask index and received records 
size. 
   The main subtask will check every subtask file and parse file to make sure 
is shuold to create new instant. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


n3nash commented on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-758449461


   @pratyakshsharma Do you have a use-case of deleting fields ? What is the 
reason for supporting deleting fields. Has deleting fields case been tested for 
all types of cases such as upserts ? Generally, the parquet-avro reader will 
throw an exception right now when a smaller schema (schema for which a field 
has been deleted) is used to read a parquet file written with a larger schema. 
Have you tested this scenario ? 
   If not, I suggest we revert this particular change and think of a more 
holistic way to support deletion of fields from schema. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] loukey-lj closed pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


loukey-lj closed pull request #2433:
URL: https://github.com/apache/hudi/pull/2433


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2414: [SUPPORT]

2021-01-11 Thread GitBox


bvaradar closed issue #2414:
URL: https://github.com/apache/hudi/issues/2414


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar edited a comment on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

2021-01-11 Thread GitBox


bvaradar edited a comment on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327


   Hudi does not synchronize on partition path creation. Instead, each executor 
task which is about to write to a parquet file ensures the directory path 
exists by issuing fs.mkdirs call. Added : 
https://issues.apache.org/jira/browse/HUDI-1523
   
   If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call 
with getFileStatus() -
   ```
   diff --git 
a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java 
b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   index d148b1b8..11b3cb49 100644
   --- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   +++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   @@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle extends H
  public Path makeNewPath(String partitionPath) {
Path path = FSUtils.getPartitionPath(config.getBasePath(), 
partitionPath);
try {
   -  fs.mkdirs(path); // create a new partition as needed.
   +  if (!fs.exists(path)) {
   +fs.mkdirs(path); // create a new partition as needed.
   +  }
} catch (IOException e) {
  throw new HoodieIOException("Failed to make dir " + path, e);
}
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

2021-01-11 Thread GitBox


bvaradar commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327


   Hudi does not synchronize on partition path creation. Instead, each executor 
task which is about to write to a parquet file ensures the directory path 
exists by issuing fs.mkdirs call. Added : 
https://issues.apache.org/jira/browse/HUDI-1523
   
   If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call 
with getFileStatus() -
   `diff --git 
a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java 
b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   index d148b1b8..11b3cb49 100644
   --- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   +++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   @@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle extends H
  public Path makeNewPath(String partitionPath) {
Path path = FSUtils.getPartitionPath(config.getBasePath(), 
partitionPath);
try {
   -  fs.mkdirs(path); // create a new partition as needed.
   +  if (!fs.exists(path)) {
   +fs.mkdirs(path); // create a new partition as needed.
   +  }
} catch (IOException e) {
  throw new HoodieIOException("Failed to make dir " + path, e);
}`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1523) Avoid excessive mkdir calls when creating new files

2021-01-11 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1523:


 Summary: Avoid excessive mkdir calls when creating new files
 Key: HUDI-1523
 URL: https://issues.apache.org/jira/browse/HUDI-1523
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.8.0


https://github.com/apache/hudi/issues/2423



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Karl-WangSK commented on pull request #2260: [HUDI-1381] Schedule compaction based on time elapsed

2021-01-11 Thread GitBox


Karl-WangSK commented on pull request #2260:
URL: https://github.com/apache/hudi/pull/2260#issuecomment-758424635


   @wangxianghu 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2432: [SUPPORT] write hudi data failed when using Deltastreamer

2021-01-11 Thread GitBox


bvaradar commented on issue #2432:
URL: https://github.com/apache/hudi/issues/2432#issuecomment-758403084


   @quitozang : Binding to port 0 should ensure that OS assigns a random free 
port. I am not sure why you are seeing the error. You can workaround by setting 
hoodie.embed.timeline.server.port to a designated port between (a number 
between 1025 and 65535) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


danny0405 commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758372866


   The file check of each task is useless because even if a task of the source 
has no data for some time interval, the checkpoint still can trigger normally. 
So all task checkpoint successfully does not mean there is data. (I have solved 
this in mu #step1 PR though)
   
   There is no need to checkpoint the write status in 
KeyedWriteProcessOperator, because we can not start a new instant if the last 
instant failes, the more proper/simple way is to retry the commit actions 
several times and trigger failover if still fails.
   
   BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many 
bugs and has many improvements. After that i would add a compatible pipeline 
and this PR can apply there, and i can help to review.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jtmzheng commented on issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset

2021-01-11 Thread GitBox


jtmzheng commented on issue #2408:
URL: https://github.com/apache/hudi/issues/2408#issuecomment-758360941


   Thanks Udit! I'd tried setting `hoodie.commits.archival.batch` to 5 earlier 
today after going through the source code - that got my application back and 
running again. 
   
   The first bug definitely seems like the root cause, after turning on more 
verbose logging I found several 300mb commit files being loaded in for archival 
before the crash (re: the second bug 
https://github.com/apache/hudi/blob/e3d3677b7e7899705b624925666317f0c074f7c7/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L353
 clears the list, which isn't the most intuitive). It seems like these large 
commit files were generated when I set `hoodie.cleaner.commits.retained` to 1.
   
   What is the trade-off in lowering `hoodie.keep.max.commits` and 
`hoodie.keep.min.commits`? I couldn't find much good documentation on the 
archival process/configs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-11 Thread GitBox


garyli1019 commented on a change in pull request #2412:
URL: https://github.com/apache/hudi/pull/2412#discussion_r555477972



##
File path: pom.xml
##
@@ -1361,6 +1363,7 @@
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
+true

Review comment:
   Do we need a vice versa? spark 2 skip unit tests of spark 3?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback

2021-01-11 Thread GitBox


lw309637554 commented on pull request #2418:
URL: https://github.com/apache/hudi/pull/2418#issuecomment-758337679


   LGTM 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback

2021-01-11 Thread GitBox


lw309637554 commented on a change in pull request #2418:
URL: https://github.com/apache/hudi/pull/2418#discussion_r555456066



##
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/HoodieClientRollbackTestBase.java
##
@@ -96,4 +99,61 @@ protected void 
twoUpsertCommitDataWithTwoPartitions(List firstPartiti
   assertEquals(1, secondPartitionCommit2FileSlices.size());
 }
   }
+
+  protected void insertOverwriteCommitDataWithTwoPartitions(List 
firstPartitionCommit2FileSlices,
+List 
secondPartitionCommit2FileSlices,
+HoodieWriteConfig 
cfg,
+boolean 
commitSecondInsertOverwrite) throws IOException {
+//just generate two partitions
+dataGen = new HoodieTestDataGenerator(new 
String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH});
+HoodieTestDataGenerator.writePartitionMetadata(fs, new 
String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH}, 
basePath);
+SparkRDDWriteClient client = getHoodieWriteClient(cfg);
+/**
+ * Write 1 (upsert)
+ */
+String newCommitTime = "001";
+List records = 
dataGen.generateInsertsContainsAllPartitions(newCommitTime, 2);

Review comment:
   ok,Make sense





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1520) add configure for spark sql overwrite use replace

2021-01-11 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1520.
-
Resolution: Fixed

> add configure for spark sql overwrite use replace
> -
>
> Key: HUDI-1520
> URL: https://issues.apache.org/jira/browse/HUDI-1520
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


yanghua commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758327239


   @danny0405 wdyt about this optimization?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1511) InstantGenerateOperator support multiple parallelism

2021-01-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1511:
-
Labels: pull-request-available  (was: )

> InstantGenerateOperator support multiple parallelism
> 
>
> Key: HUDI-1511
> URL: https://issues.apache.org/jira/browse/HUDI-1511
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-11 Thread GitBox


yanghua commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758321935


   @loukey-lj thanks for your contribution! Can you please:
   
   1) Fix the Travis issue? It's red now;
   2) Update the RFC-13 and describe your optimization.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset

2021-01-11 Thread GitBox


umehrot2 commented on issue #2408:
URL: https://github.com/apache/hudi/issues/2408#issuecomment-758321313


   For now, I would suggest to archive at smaller intervals. May be try out 
something like:
   - `hoodie.keep.max.commits`: 10
   - `hoodie.keep.min.commits`: 10



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset

2021-01-11 Thread GitBox


umehrot2 commented on issue #2408:
URL: https://github.com/apache/hudi/issues/2408#issuecomment-758320870


   I took a deeper look at this. For you this seems to be happening in the 
archival code path:
   ```
at 
org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309)
at 
org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282)
at 
org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133)
at 
org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381)
   ```
   
   In `HoodieTimelineArchiveLog` where it needs to write log files with commit 
record, similar to how log files are written for MOR tables. However, in this 
code I notice a couple of issues:
   - The default maximum log block size of 256 MB defined 
[here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51),
 is not utilized for this class and is only used for the MOR log blocks writing 
case. As a result, there is no real control over the block size that it can end 
up writing which can potentially overflow `ByteArrayOutputStream` whose maximum 
size is `Integer.MAX_VALE - 8`. That is what seems to be happening in this 
scenario here because of an integer overflow following that code path inside 
`ByteArrayOutputStream`. So we need to use the maximum block size concept here 
as well.
   - In addition I see a bug in code 
[here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302)
 where even after flushing out the records into a file after a batch size of 10 
(default) it is not clearing the list and just goes on accumulating the 
records. This seems logically wrong as well (duplication), apart from the fact 
that it would keep increasing the log file blocks size it is writing. 
   
   I will open a jira for this issue to track this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1502] MOR rollback and restore support for metadata sync (#2421)

2021-01-11 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e3d3677  [HUDI-1502] MOR rollback and restore support for metadata 
sync (#2421)
e3d3677 is described below

commit e3d3677b7e7899705b624925666317f0c074f7c7
Author: Sivabalan Narayanan 
AuthorDate: Mon Jan 11 16:23:13 2021 -0500

[HUDI-1502] MOR rollback and restore support for metadata sync (#2421)

- Adds field to RollbackMetadata that capture the logs written for rollback 
blocks
- Adds field to RollbackMetadata that capture new logs files written by 
unsynced deltacommits

Co-authored-by: Vinoth Chandar 
---
 .../org/apache/hudi/io/HoodieAppendHandle.java |   1 +
 .../AbstractMarkerBasedRollbackStrategy.java   |  26 +++--
 .../BaseMergeOnReadRollbackActionExecutor.java |   3 +-
 .../hudi/table/action/rollback/RollbackUtils.java  |   6 +-
 .../rollback/ListingBasedRollbackHelper.java   |  21 +++-
 .../rollback/SparkMarkerBasedRollbackStrategy.java |  14 +++
 .../hudi/metadata/TestHoodieBackedMetadata.java|  43 ---
 .../rollback/TestMarkerBasedRollbackStrategy.java  | 130 +++--
 .../src/main/avro/HoodieRollbackMetadata.avsc  |  16 ++-
 .../org/apache/hudi/common/HoodieRollbackStat.java |  20 +++-
 .../java/org/apache/hudi/common/fs/FSUtils.java|  13 ++-
 .../table/timeline/TimelineMetadataUtils.java  |  28 ++---
 .../hudi/metadata/HoodieTableMetadataUtil.java |  34 --
 .../hudi/common/table/TestTimelineUtils.java   |  27 ++---
 .../table/view/TestIncrementalFSViewSync.java  |   2 +-
 15 files changed, 268 insertions(+), 116 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
index 3faac2e..c6ea7ba 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
@@ -164,6 +164,7 @@ public class HoodieAppendHandle extends
 // Since the actual log file written to can be different based on when 
rollover happens, we use the
 // base file to denote some log appends happened on a slice. 
writeToken will still fence concurrent
 // writers.
+// https://issues.apache.org/jira/browse/HUDI-1517
 createMarkerFile(partitionPath, 
FSUtils.makeDataFileName(baseInstantTime, writeToken, fileId, 
hoodieTable.getBaseFileExtension()));
 
 this.writer = createLogWriter(fileSlice, baseInstantTime);
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java
index cb6fff9..cc596ba 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java
@@ -53,9 +53,9 @@ public abstract class AbstractMarkerBasedRollbackStrategy table, 
HoodieEngineContext context, HoodieWriteConfig config, String instantTime) {
 this.table = table;
@@ -90,6 +90,7 @@ public abstract class AbstractMarkerBasedRollbackStrategy writtenLogFileSizeMap = 
getWrittenLogFileSizeMap(partitionPath, baseCommitTime, fileId);
 
 HoodieLogFormat.Writer writer = null;
 try {
@@ -121,17 +122,26 @@ public abstract class 
AbstractMarkerBasedRollbackStrategy filesToNumBlocksRollback = Collections.emptyMap();
-if (config.useFileListingMetadata()) {
-  // When metadata is enabled, the information of files appended to is 
required
-  filesToNumBlocksRollback = Collections.singletonMap(
+// the information of files appended to is required for metadata sync
+Map filesToNumBlocksRollback = Collections.singletonMap(
   
table.getMetaClient().getFs().getFileStatus(Objects.requireNonNull(writer).getLogFile().getPath()),
   1L);
-}
 
 return HoodieRollbackStat.newBuilder()
 .withPartitionPath(partitionPath)
 .withRollbackBlockAppendResults(filesToNumBlocksRollback)
-.build();
+.withWrittenLogFileSizeMap(writtenLogFileSizeMap).build();
+  }
+
+  /**
+   * Returns written log file size map for the respective baseCommitTime to 
assist in metadata table syncing.
+   * @param partitionPath partition path of interest
+   * @param baseCommitTime base commit time of interest
+   * @param fileId fileId of interest
+   * @return Map
+   * @throws IOException
+   */
+  protected Map getWrittenLogFileSizeMap(String 
partitionPath, String 

[GitHub] [hudi] vinothchandar merged pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync

2021-01-11 Thread GitBox


vinothchandar merged pull request #2421:
URL: https://github.com/apache/hudi/pull/2421


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-11 Thread GitBox


codecov-io edited a comment on pull request #2412:
URL: https://github.com/apache/hudi/pull/2412#issuecomment-755726635







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-11 Thread GitBox


codecov-io edited a comment on pull request #2412:
URL: https://github.com/apache/hudi/pull/2412#issuecomment-755726635


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=h1) Report
   > Merging 
[#2412](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=desc) (9b9a5c9) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.54%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2412/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2412   +/-   ##
   
   - Coverage 50.23%   9.68%   -40.55% 
   + Complexity 2985  48 -2937 
   
 Files   410  53  -357 
 Lines 183981930-16468 
 Branches   1884 230 -1654 
   
   - Hits   9242 187 -9055 
   + Misses 83981730 -6668 
   + Partials758  13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.97%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 

[GitHub] [hudi] codecov-io edited a comment on pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync

2021-01-11 Thread GitBox


codecov-io edited a comment on pull request #2421:
URL: https://github.com/apache/hudi/pull/2421#issuecomment-757112911


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=h1) Report
   > Merging 
[#2421](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=desc) (c2647c3) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c151147819abb6b4c418138d9b401f374fc021a0?el=desc)
 (c151147) will **decrease** coverage by `0.35%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2421/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=tree)
   
   ```diff
   @@ Coverage Diff @@
   ## master   #2421  +/-   ##
   ===
   - Coverage 10.04%   9.68%   -0.36% 
 Complexity   48  48  
   ===
 Files52  53   +1 
 Lines  18521930  +78 
 Branches223 230   +7 
   ===
   + Hits186 187   +1 
   - Misses 16531730  +77 
 Partials 13  13  
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `9.68% <ø> (-0.36%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `34.10% <0.00%> (-1.44%)` | `11.00% <0.00%> (ø%)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (?%)` | |
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `83.62% <0.00%> (+0.14%)` | `28.00% <0.00%> (ø%)` | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on issue #2346: [SUPPORT]The rt view query returns a wrong result with predicate push down.

2021-01-11 Thread GitBox


satishkotha commented on issue #2346:
URL: https://github.com/apache/hudi/issues/2346#issuecomment-758169722


   @sumihehe Did you get a chance to look at above? It'll  be helpful if you 
can provide more information.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


pratyakshsharma commented on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-758162231


   @n3nash Just a high level thought before going through the changes 
thoroughly. How about keeping the old changes also and introduce a config 
`setDefaultValueWithSchemaEvolutionByDeletingFields` to support schema 
evolutions in case of deletion of a field? By default we can keep it as false 
to avoid the degradation as pointed out by @prashantwason . Thoughts?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


pratyakshsharma commented on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-758154222


   > @n3nash what is the commit being reverted?
   
   
https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-11 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262865#comment-17262865
 ] 

Pratyaksh Sharma commented on HUDI-1509:


[~nishith29] The PR was a generic one where point #2 mentioned by you was also 
supported. To answer your other question, I did not notice any such 
degradation. 

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.6.1, 0.7.0
>Reporter: Prashant Wason
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


prashantwason commented on a change in pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#discussion_r555258349



##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord 
left, GenericRecord righ
 return result;
   }
 
-  /**
-   * Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the old
-   * schema.
-   */
-  public static GenericRecord rewriteRecord(GenericRecord record, Schema 
newSchema) {
-return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), 
newSchema), newSchema);
-  }
-
   /**
* Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the new
* schema.
+   * NOTE: Here, the assumption is that you cannot go from an evolved schema 
(schema with (N) fields)
+   * to an older schema (schema with (N-1) fields). All fields present in the 
older record schema MUST be present in the
+   * new schema and the default/existing values are carried over.
+   * This particular method does the following things :
+   * a) Create a new empty GenericRecord with the new schema.
+   * b) Set default values for all fields of this transformed schema in the 
new GenericRecord or copy over the data
+   * from the old schema to the new schema
+   * c) hoodie_metadata_fields have a special treatment. This is done because 
for code generated AVRO classes
+   * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type 
instead of a GenericRecord.
+   * SpecificBaseRecord throws null pointer exception for record.get(name) if 
name is not present in the schema of the
+   * record (which happens when converting a SpecificBaseRecord without 
hoodie_metadata_fields to a new record with.
*/
-  public static GenericRecord 
rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) {
-return rewrite(record, new LinkedHashSet<>(newSchema.getFields()), 
newSchema);
-  }
-
-  private static GenericRecord rewrite(GenericRecord record, 
LinkedHashSet fieldsToWrite, Schema newSchema) {
+  public static GenericRecord rewriteRecord(GenericRecord oldRecord, Schema 
newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
-for (Schema.Field f : fieldsToWrite) {
-  if (record.get(f.name()) == null) {
+boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase;
+for (Schema.Field f : newSchema.getFields()) {
+  if (!isMetadataField(f.name()) && oldRecord.get(f.name()) == null) {
+// if not metadata field, set defaults
 if (f.defaultVal() instanceof JsonProperties.Null) {
   newRecord.put(f.name(), null);
 } else {
   newRecord.put(f.name(), f.defaultVal());
 }
-  } else {
-newRecord.put(f.name(), record.get(f.name()));
+  } else if (!isMetadataField(f.name())) {
+// if not metadata field, copy old value
+newRecord.put(f.name(), oldRecord.get(f.name()));
+  } else if (!isSpecificRecord) {
+// if not specific record, copy value for hoodie metadata fields as 
well

Review comment:
   This seems counter-intuitive to the comment in the method. 
   
   If SpecificRecord.get() throws NULL exception if the field is not there, 
wont we want to populate the metadata fields for it?

##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord 
left, GenericRecord righ
 return result;
   }
 
-  /**
-   * Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the old
-   * schema.
-   */
-  public static GenericRecord rewriteRecord(GenericRecord record, Schema 
newSchema) {
-return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), 
newSchema), newSchema);
-  }
-
   /**
* Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the new
* schema.
+   * NOTE: Here, the assumption is that you cannot go from an evolved schema 
(schema with (N) fields)
+   * to an older schema (schema with (N-1) fields). All fields present in the 
older record schema MUST be present in the
+   * new schema and the default/existing values are carried over.
+   * This particular method does the following things :
+   * a) Create a new empty GenericRecord with the new schema.
+   * b) Set default values for all fields of this transformed schema in the 
new GenericRecord or copy over the data
+   * from the old schema to the new schema
+   * c) hoodie_metadata_fields have a special treatment. This is done because 
for code generated AVRO classes
+   * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type 
instead of a GenericRecord.
+   * SpecificBaseRecord throws null 

[GitHub] [hudi] satishkotha commented on a change in pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback

2021-01-11 Thread GitBox


satishkotha commented on a change in pull request #2418:
URL: https://github.com/apache/hudi/pull/2418#discussion_r555258959



##
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/HoodieClientRollbackTestBase.java
##
@@ -96,4 +99,61 @@ protected void 
twoUpsertCommitDataWithTwoPartitions(List firstPartiti
   assertEquals(1, secondPartitionCommit2FileSlices.size());
 }
   }
+
+  protected void insertOverwriteCommitDataWithTwoPartitions(List 
firstPartitionCommit2FileSlices,

Review comment:
   Yes, I want to add test for MOR too. But because of time constraints 
just added one for COW. I plan to add MOR test later on.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback

2021-01-11 Thread GitBox


satishkotha commented on a change in pull request #2418:
URL: https://github.com/apache/hudi/pull/2418#discussion_r555258581



##
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/HoodieClientRollbackTestBase.java
##
@@ -96,4 +99,61 @@ protected void 
twoUpsertCommitDataWithTwoPartitions(List firstPartiti
   assertEquals(1, secondPartitionCommit2FileSlices.size());
 }
   }
+
+  protected void insertOverwriteCommitDataWithTwoPartitions(List 
firstPartitionCommit2FileSlices,
+List 
secondPartitionCommit2FileSlices,
+HoodieWriteConfig 
cfg,
+boolean 
commitSecondInsertOverwrite) throws IOException {
+//just generate two partitions
+dataGen = new HoodieTestDataGenerator(new 
String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH});
+HoodieTestDataGenerator.writePartitionMetadata(fs, new 
String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH}, 
basePath);
+SparkRDDWriteClient client = getHoodieWriteClient(cfg);
+/**
+ * Write 1 (upsert)
+ */
+String newCommitTime = "001";
+List records = 
dataGen.generateInsertsContainsAllPartitions(newCommitTime, 2);

Review comment:
   We can only reuse two lines because some of the state generated 
(List) is needed for subsequent operations. So i'm not inclined 
to add another method for this. let me know if you have strong opinion.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1291) integration of replace with consolidated metadata

2021-01-11 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish closed HUDI-1291.

Resolution: Fixed

done as part of HUDI-1276

> integration of replace with consolidated metadata
> -
>
> Key: HUDI-1291
> URL: https://issues.apache.org/jira/browse/HUDI-1291
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> Consolidated metadata supports two operations:
> 1) add files
> 2) remove files
> If file group is replaced, we don't want to immediately remove files from 
> metadata (to support restore/rollback). So we may have to change consolidated 
> metadata to track additional information for replaced fileIds
> Today, Replaced FileIds are tracked as part of .replacecommit file. We can 
> continue this for short term, but may be more efficient to combine with 
> consolidated metadata



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


prashantwason commented on a change in pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#discussion_r555221595



##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord 
left, GenericRecord righ
 return result;
   }
 
-  /**
-   * Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the old
-   * schema.
-   */
-  public static GenericRecord rewriteRecord(GenericRecord record, Schema 
newSchema) {
-return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), 
newSchema), newSchema);
-  }
-
   /**
* Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the new
* schema.
+   * NOTE: Here, the assumption is that you cannot go from an evolved schema 
(schema with (N) fields)
+   * to an older schema (schema with (N-1) fields). All fields present in the 
older record schema MUST be present in the
+   * new schema and the default/existing values are carried over.
+   * This particular method does the following things :
+   * a) Create a new empty GenericRecord with the new schema.
+   * b) Set default values for all fields of this transformed schema in the 
new GenericRecord or copy over the data
+   * from the old schema to the new schema
+   * c) hoodie_metadata_fields have a special treatment. This is done because 
for code generated AVRO classes
+   * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type 
instead of a GenericRecord.
+   * SpecificBaseRecord throws null pointer exception for record.get(name) if 
name is not present in the schema of the
+   * record (which happens when converting a SpecificBaseRecord without 
hoodie_metadata_fields to a new record with.
*/
-  public static GenericRecord 
rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) {
-return rewrite(record, new LinkedHashSet<>(newSchema.getFields()), 
newSchema);
-  }
-
-  private static GenericRecord rewrite(GenericRecord record, 
LinkedHashSet fieldsToWrite, Schema newSchema) {
+  public static GenericRecord rewriteRecord(GenericRecord oldRecord, Schema 
newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
-for (Schema.Field f : fieldsToWrite) {
-  if (record.get(f.name()) == null) {
+boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase;
+for (Schema.Field f : newSchema.getFields()) {
+  if (!isMetadataField(f.name()) && oldRecord.get(f.name()) == null) {

Review comment:
   Maybe create a local variable with value of oldRecord.get(f.name()) as 
this is a hash lookup within GenericRecord. This value is used at many places 
in this function.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


prashantwason commented on a change in pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#discussion_r555219601



##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord 
left, GenericRecord righ
 return result;
   }
 
-  /**
-   * Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the old
-   * schema.
-   */
-  public static GenericRecord rewriteRecord(GenericRecord record, Schema 
newSchema) {
-return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), 
newSchema), newSchema);
-  }
-
   /**
* Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the new
* schema.
+   * NOTE: Here, the assumption is that you cannot go from an evolved schema 
(schema with (N) fields)
+   * to an older schema (schema with (N-1) fields). All fields present in the 
older record schema MUST be present in the
+   * new schema and the default/existing values are carried over.
+   * This particular method does the following things :
+   * a) Create a new empty GenericRecord with the new schema.
+   * b) Set default values for all fields of this transformed schema in the 
new GenericRecord or copy over the data

Review comment:
   Better to reword it the other way - copy over the data from old schema 
or (if absent) set default value of the field.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-11 Thread GitBox


zhedoubushishi commented on a change in pull request #2412:
URL: https://github.com/apache/hudi/pull/2412#discussion_r555214109



##
File path: pom.xml
##
@@ -110,9 +110,10 @@
 2.4.4
 3.0.0
 1.8.2
-2.11.12
+2.11.12

Review comment:
   Make sense to me. For the consistency reason, I converted ```scala-11``` 
back to ```scala11```.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync

2021-01-11 Thread GitBox


vinothchandar commented on pull request #2421:
URL: https://github.com/apache/hudi/pull/2421#issuecomment-758098600


   @nsivabalan pushed some small fixes. Please land once CI passes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync

2021-01-11 Thread GitBox


vinothchandar commented on a change in pull request #2421:
URL: https://github.com/apache/hudi/pull/2421#discussion_r555199075



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -262,18 +264,33 @@ private static void 
processRollbackMetadata(HoodieRollbackMetadata rollbackMetad
 partitionToDeletedFiles.get(partition).addAll(deletedFiles);
   }
 
-  if (!pm.getAppendFiles().isEmpty()) {
+  if (hasRollbackLogFiles) {
 if (!partitionToAppendedFiles.containsKey(partition)) {
   partitionToAppendedFiles.put(partition, new HashMap<>());
 }
 
 // Extract appended file name from the absolute paths saved in 
getAppendFiles()
-pm.getAppendFiles().forEach((path, size) -> {
+pm.getRollbackLogFiles().forEach((path, size) -> {
   partitionToAppendedFiles.get(partition).merge(new 
Path(path).getName(), size, (oldSize, newSizeCopy) -> {
 return size + oldSize;

Review comment:
   we should change it here too. the picking of largest

##
File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
##
@@ -424,10 +425,29 @@ public static boolean isLogFile(Path logPath) {
*/
   public static Stream getAllLogFiles(FileSystem fs, Path 
partitionPath, final String fileId,
   final String logFileExtension, final String baseCommitTime) throws 
IOException {
-return Arrays
-.stream(fs.listStatus(partitionPath,
-path -> path.getName().startsWith("." + fileId) && 
path.getName().contains(logFileExtension)))
-.map(HoodieLogFile::new).filter(s -> 
s.getBaseCommitTime().equals(baseCommitTime));
+try {
+  return Arrays
+  .stream(fs.listStatus(partitionPath,
+  path -> path.getName().startsWith("." + fileId) && 
path.getName().contains(logFileExtension)))
+  .map(HoodieLogFile::new).filter(s -> 
s.getBaseCommitTime().equals(baseCommitTime));
+} catch (FileNotFoundException e) {
+  return Stream.builder().build();
+}
+  }
+
+  /**
+   * Get all the log files for the passed in FileId in the partition path.
+   */
+  public static Stream getAllLogFiles(FileSystem fs, Path 
partitionPath,

Review comment:
   is this used? need to check

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java
##
@@ -116,6 +118,12 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient 
metaClient, HoodieWriteC
   .withDeletedFileResults(filesToDeletedStatus).build());
 }
 case APPEND_ROLLBACK_BLOCK: {
+  // collect all log files that is supposed to be deleted with this 
rollback
+  Map writtenLogFileSizeMap = 
FSUtils.getAllLogFiles(metaClient.getFs(),

Review comment:
   its done below anyway. So all good

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java
##
@@ -116,6 +118,12 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient 
metaClient, HoodieWriteC
   .withDeletedFileResults(filesToDeletedStatus).build());
 }
 case APPEND_ROLLBACK_BLOCK: {
+  // collect all log files that is supposed to be deleted with this 
rollback
+  Map writtenLogFileSizeMap = 
FSUtils.getAllLogFiles(metaClient.getFs(),

Review comment:
   i think we should guard the option variables here. 
`rollbackRequest.getFileId()`, `rollbackRequest.getLatestBaseInstant()` with a 
`isPresent()` check





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE

2021-01-11 Thread GitBox


vinothchandar merged pull request #2428:
URL: https://github.com/apache/hudi/pull/2428


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRITE_TABLE (#2428)

2021-01-11 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new de42adc  [HUDI-1520] add configure for spark sql overwrite use 
INSERT_OVERWRITE_TABLE (#2428)
de42adc is described below

commit de42adc2302528541145a714078cc6d10cbb8d9a
Author: lw0090 
AuthorDate: Tue Jan 12 01:07:47 2021 +0800

[HUDI-1520] add configure for spark sql overwrite use 
INSERT_OVERWRITE_TABLE (#2428)
---
 .../main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala   | 13 ++---
 .../org/apache/hudi/functional/TestCOWDataSource.scala  |  3 ++-
 .../org/apache/hudi/functional/TestMORDataSource.scala  |  1 -
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 472f450..4e9caa5 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -94,13 +94,6 @@ private[hudi] object HoodieSparkSqlWriter {
   operation = WriteOperationType.INSERT
 }
 
-// If the mode is Overwrite, can set operation to INSERT_OVERWRITE_TABLE.
-// Then in DataSourceUtils.doWriteOperation will use 
client.insertOverwriteTable to overwrite
-// the table. This will replace the old fs.delete(tablepath) mode.
-if (mode == SaveMode.Overwrite && operation !=  
WriteOperationType.INSERT_OVERWRITE_TABLE) {
-  operation = WriteOperationType.INSERT_OVERWRITE_TABLE
-}
-
 val jsc = new JavaSparkContext(sparkContext)
 val basePath = new Path(path.get)
 val instantTime = HoodieActiveTimeline.createNewInstantTime()
@@ -340,6 +333,12 @@ private[hudi] object HoodieSparkSqlWriter {
 if (operation != WriteOperationType.DELETE) {
   if (mode == SaveMode.ErrorIfExists && tableExists) {
 throw new HoodieException(s"hoodie table at $tablePath already 
exists.")
+  } else if (mode == SaveMode.Overwrite && tableExists && operation !=  
WriteOperationType.INSERT_OVERWRITE_TABLE) {
+// When user set operation as INSERT_OVERWRITE_TABLE,
+// overwrite will use INSERT_OVERWRITE_TABLE operator in 
doWriteOperation
+log.warn(s"hoodie table at $tablePath already exists. Deleting 
existing data & overwriting with new data.")
+fs.delete(tablePath, true)
+tableExists = false
   }
 } else {
   // Delete Operation only supports Append mode
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
index 730b7d2..b15a7d4 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
@@ -202,7 +202,7 @@ class TestCOWDataSource extends HoodieClientTestBase {
 val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
 inputDF2.write.format("org.apache.hudi")
   .options(commonOpts)
-  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL)
   .mode(SaveMode.Overwrite)
   .save(basePath)
 
@@ -229,6 +229,7 @@ class TestCOWDataSource extends HoodieClientTestBase {
 val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
 inputDF2.write.format("org.apache.hudi")
   .options(commonOpts)
+  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL)
   .mode(SaveMode.Overwrite)
   .save(basePath)
 
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
index 121957e..1ea6ceb 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
@@ -278,7 +278,6 @@ class TestMORDataSource extends HoodieClientTestBase {
 val inputDF5: Dataset[Row] = 
spark.read.json(spark.sparkContext.parallelize(records5, 2))
 inputDF5.write.format("org.apache.hudi")
   .options(commonOpts)
-  .option("hoodie.compact.inline", "true")
   .mode(SaveMode.Append)
   .save(basePath)
 val commit5Time = 

[GitHub] [hudi] vinothchandar commented on a change in pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE

2021-01-11 Thread GitBox


vinothchandar commented on a change in pull request #2428:
URL: https://github.com/apache/hudi/pull/2428#discussion_r555203377



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -340,6 +333,12 @@ private[hudi] object HoodieSparkSqlWriter {
 if (operation != WriteOperationType.DELETE) {
   if (mode == SaveMode.ErrorIfExists && tableExists) {
 throw new HoodieException(s"hoodie table at $tablePath already 
exists.")
+  } else if (mode == SaveMode.Overwrite && tableExists && operation !=  
WriteOperationType.INSERT_OVERWRITE_TABLE) {

Review comment:
   this is inline with 
https://github.com/apache/hudi/blob/b6d363c0d4a6906ef1c9ea54c0263c6b5915e0b6/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L301
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE

2021-01-11 Thread GitBox


vinothchandar commented on a change in pull request #2428:
URL: https://github.com/apache/hudi/pull/2428#discussion_r555202292



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
##
@@ -278,7 +278,6 @@ class TestMORDataSource extends HoodieClientTestBase {
 val inputDF5: Dataset[Row] = 
spark.read.json(spark.sparkContext.parallelize(records5, 2))
 inputDF5.write.format("org.apache.hudi")
   .options(commonOpts)
-  .option("hoodie.compact.inline", "true")

Review comment:
   This is great. Thanks for diggin in both ! 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE

2021-01-11 Thread GitBox


vinothchandar commented on a change in pull request #2428:
URL: https://github.com/apache/hudi/pull/2428#discussion_r555202292



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
##
@@ -278,7 +278,6 @@ class TestMORDataSource extends HoodieClientTestBase {
 val inputDF5: Dataset[Row] = 
spark.read.json(spark.sparkContext.parallelize(records5, 2))
 inputDF5.write.format("org.apache.hudi")
   .options(commonOpts)
-  .option("hoodie.compact.inline", "true")

Review comment:
   This is great. Thanks for diggin in both @lw309637554 ! 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pranotishanbhag removed a comment on issue #2414: [SUPPORT]

2021-01-11 Thread GitBox


pranotishanbhag removed a comment on issue #2414:
URL: https://github.com/apache/hudi/issues/2414#issuecomment-758042202


   Hi,
   
   I tried copy_on_write with insert mode for 4.6 TB dataset which is failing 
with lost nodes (previously tried bulk_insert which worked fine). I tried to 
tweak the executor memory and also changes the config 
"hoodie.copyonwrite.insert.split.size" to 100k. It still failed. Later i tried 
to add the config "hoodie.insert.shuffle.parallelism" and set it to a higher 
value and still I see failures. Please can you tell me what is wrong here.
   
   Thanks,
   Pranoti



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pranotishanbhag commented on issue #2414: [SUPPORT]

2021-01-11 Thread GitBox


pranotishanbhag commented on issue #2414:
URL: https://github.com/apache/hudi/issues/2414#issuecomment-758042202


   Hi,
   
   I tried copy_on_write with insert mode for 4.6 TB dataset which is failing 
with lost nodes (previously tried bulk_insert which worked fine). I tried to 
tweak the executor memory and also changes the config 
"hoodie.copyonwrite.insert.split.size" to 100k. It still failed. Later i tried 
to add the config "hoodie.insert.shuffle.parallelism" and set it to a higher 
value and still I see failures. Please can you tell me what is wrong here.
   
   Thanks,
   Pranoti



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] loukey-lj opened a new pull request #2433: Hudi 1511

2021-01-11 Thread GitBox


loukey-lj opened a new pull request #2433:
URL: https://github.com/apache/hudi/pull/2433


   InstantGenerateOperator support multiple parallelism.
   When InstantGenerateOperator subtask size greater than 1 we can set subtask 
0 as a main subtask, only main task create new instant.
   The prerequisite of create new instant is exist subtask received data in 
current checkpoint. Every subtask will create a tmp file,
   flie name is make up by checkpointid,subtask index and received records 
size. 
   The main subtask will check every subtask file and parse file to make sure 
is shuold to create new instant. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query

2021-01-11 Thread GitBox


garyli1019 commented on a change in pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#discussion_r555064333



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala
##
@@ -108,7 +111,7 @@ class MergeOnReadSnapshotRelation(val sqlContext: 
SQLContext,
   dataSchema = tableStructSchema,
   partitionSchema = StructType(Nil),

Review comment:
   hi @yui2010 , how's your dataset looks like? Does it has a `dt` column 
in the dataset? The partitioning I am referring to is that when you 
`spark.read.format('hudi').load(basePath)` and your dataset folder structure 
looks like `basePath/dt=20201010`, then Spark is able to append a `dt` column 
to your dataset. When you do sth like `df.filter(dt=20201010)`, spark will go 
to this partition and read the file. How's your workflow to load your data and 
pass the partition information to Spark?
   In order to get more information about this implementation, would you write 
a test to demo the partition pruning? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yui2010 commented on a change in pull request #2427: [HUDI-1519] Improve minKey/maxKey compute in HoodieHFileWriter

2021-01-11 Thread GitBox


yui2010 commented on a change in pull request #2427:
URL: https://github.com/apache/hudi/pull/2427#discussion_r555027450



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
##
@@ -121,17 +121,10 @@ public void writeAvro(String recordKey, IndexedRecord 
object) throws IOException
 
 if (hfileConfig.useBloomFilter()) {
   hfileConfig.getBloomFilter().add(recordKey);
-  if (minRecordKey != null) {
-minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : 
recordKey;
-  } else {
+  if (minRecordKey == null) {

Review comment:
   Hi @vinothchandar, Thanks for reviewing. it not computing the min/max. 
it only use the first recordKey and the last recordKey as 
min/max(HoodieSortedMergeHandle/BaseSparkCommitActionExecutor already compare 
the input records in order by recordKey) . It's like hbase store 
keyRange(firstKey/lastKey) as 
   
[https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292](https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292)
   and actually we also can get min/max RecordKey from HFile native throught 
`HFileReaderV3#getFirstKey()` and `HFileReaderV3#getLastRowKey()` from 
load-on-open section
   
   and i think we can use current implement(put min/max in FileInfo map). maybe 
we will add more properties. for example: add recordCount so we can  choose 
seekto or loadall in `HoodieHFileReader#filterRowKeys`






This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yui2010 commented on a change in pull request #2427: [HUDI-1519] Improve minKey/maxKey compute in HoodieHFileWriter

2021-01-11 Thread GitBox


yui2010 commented on a change in pull request #2427:
URL: https://github.com/apache/hudi/pull/2427#discussion_r555027450



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
##
@@ -121,17 +121,10 @@ public void writeAvro(String recordKey, IndexedRecord 
object) throws IOException
 
 if (hfileConfig.useBloomFilter()) {
   hfileConfig.getBloomFilter().add(recordKey);
-  if (minRecordKey != null) {
-minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : 
recordKey;
-  } else {
+  if (minRecordKey == null) {

Review comment:
   Hi @vinothchandar, Thanks for reviewing. it not computing the min/max. 
it only use the first recordKey and the last recordKey as 
min/max(HoodieSortedMergeHandle/BaseSparkCommitActionExecutor already compare 
the input records in order by recordKey) . It's like hbase store 
keyRange(firstKey/lastKey) as 
   
[https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292](https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292)
   and actually we also can get min/max RecordKey from HFile native thought 
`HFileReaderV3#getFirstKey()` and `HFileReaderV3#getLastRowKey()` from 
load-on-open section
   
   and i think we can use current implement(put min/max in FileInfo map). maybe 
we will add more properties. for example: add recordCount so we can  choose 
seekto or loadall in `HoodieHFileReader#filterRowKeys`






This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2431: [HUDI-2431]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-11 Thread GitBox


codecov-io commented on pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#issuecomment-757929313


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=h1) Report
   > Merging 
[#2431](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=desc) (fa597aa) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc)
 (7ce3ac7) will **decrease** coverage by `41.08%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2431/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2431   +/-   ##
   
   - Coverage 50.76%   9.68%   -41.09% 
   + Complexity 3063  48 -3015 
   
 Files   419  53  -366 
 Lines 187771930-16847 
 Branches   1918 230 -1688 
   
   - Hits   9533 187 -9346 
   + Misses 84681730 -6738 
   + Partials776  13  -763 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 

[GitHub] [hudi] quitozang opened a new issue #2432: [SUPPORT] write hudi data failed when using Deltastreamer

2021-01-11 Thread GitBox


quitozang opened a new issue #2432:
URL: https://github.com/apache/hudi/issues/2432


   When i write hudi data using DeltaStreamer, sometimes will get this error 
below
   
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.4
   
   * Hive version : 2.1.1
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   2021-01-11 19:29:01,966 [pool-19-thread-1] INFO  io.javalin.Javalin  - 
Starting Javalin ...
   2021-01-11 19:29:01,975 [pool-19-thread-1] ERROR io.javalin.Javalin  - 
Failed to start Javalin
   2021-01-11 19:29:01,975 [pool-19-thread-1] ERROR 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  - Shutting down 
delta-sync due to exception
   java.lang.RuntimeException: Port already in use. Make sure no other process 
is using port 0 and try again.
   at io.javalin.Javalin.start(Javalin.java:157)
   at io.javalin.Javalin.start(Javalin.java:119)
   at 
org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:106)
   at 
org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
   at 
org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:105)
   at 
org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:72)
   at 
org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:81)
   at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:121)
   at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:108)
   at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:104)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.setupWriteClientAddColumn(DeltaSync.java:673)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:309)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:609)
   at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   at java.base/java.lang.Thread.run(Thread.java:834)
   Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:0
   at 
org.apache.hudi.org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:346)
   at 
org.apache.hudi.org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:308)
   at 
org.apache.hudi.org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
   at 
org.apache.hudi.org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
   at 
org.apache.hudi.org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
   at 
org.apache.hudi.org.eclipse.jetty.server.Server.doStart(Server.java:394)
   at 
org.apache.hudi.org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
   at 
io.javalin.core.util.JettyServerUtil.initialize(JettyServerUtil.kt:110)
   at io.javalin.Javalin.start(Javalin.java:140)
   ... 16 more
   Caused by: java.net.BindException: Address already in use
   at java.base/sun.nio.ch.Net.bind0(Native Method)
   at java.base/sun.nio.ch.Net.bind(Net.java:461)
   at java.base/sun.nio.ch.Net.bind(Net.java:453)
   at 
java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
   at 
java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
   at 
org.apache.hudi.org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:342)
   ... 24 more
   2021-01-11 19:29:01,976 [pool-19-thread-1] INFO  
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  - Delta Sync 
shutdown. Error ?true
   2021-01-11 19:29:01,978 [Monitor Thread] ERROR 
org.apache.hudi.async.AbstractAsyncService  - Monitor noticed one or more 
threads failed. Requesting graceful shutdown of other threads
   java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: java.lang.RuntimeException: Port 
already in use. Make sure no other process is using port 0 and try again.
   at io.javalin.Javalin.start(Javalin.java:157)
   at io.javalin.Javalin.start(Javalin.java:119)
   at 
org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:106)
   at 
org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
   
   
   
   



[GitHub] [hudi] teeyog opened a new pull request #2431: translate the api partitionBy to `hoodie.dat…

2021-01-11 Thread GitBox


teeyog opened a new pull request #2431:
URL: https://github.com/apache/hudi/pull/2431


   …asource.write.partitionpath.field`
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas

2021-01-11 Thread GitBox


codecov-io edited a comment on pull request #2424:
URL: https://github.com/apache/hudi/pull/2424#issuecomment-757403445


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=h1) Report
   > Merging 
[#2424](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=desc) (37126a3) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc)
 (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2424/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2424   +/-   ##
   
   - Coverage 50.53%   9.68%   -40.85% 
   + Complexity 3032  48 -2984 
   
 Files   417  53  -364 
 Lines 187271930-16797 
 Branches   1917 230 -1687 
   
   - Hits   9463 187 -9276 
   + Misses 84891730 -6759 
   + Partials775  13  -762 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.73%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%> 

[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer

2021-01-11 Thread GitBox


codecov-io edited a comment on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=h1) Report
   > Merging 
[#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=desc) (7961488) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc)
 (7ce3ac7) will **increase** coverage by `0.81%`.
   > The diff coverage is `59.79%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2430  +/-   ##
   
   + Coverage 50.76%   51.58%   +0.81% 
   - Complexity 3063 3109  +46 
   
 Files   419  418   -1 
 Lines 1877718975 +198 
 Branches   1918 1931  +13 
   
   + Hits   9533 9789 +256 
   + Misses 8468 8389  -79 
   - Partials776  797  +21 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.26% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `52.05% <33.33%> (-0.03%)` | `0.00 <2.00> (ø)` | |
   | hudiflink | `53.98% <59.95%> (+43.78%)` | `0.00 <56.00> (ø)` | |
   | hudihadoopmr | `33.06% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `66.07% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==)
 | `57.07% <ø> (-0.21%)` | `41.00 <0.00> (ø)` | |
   | 
[...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/operator/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...ache/hudi/operator/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...i/common/model/OverwriteWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvUGF5bG9hZC5qYXZh)
 | `60.00% <33.33%> (-4.71%)` | `10.00 <2.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh)
 | `35.89% <47.36%> (+24.26%)` | `9.00 <8.00> (+6.00)` | |
   | 
[.../hudi/operator/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yQ29vcmRpbmF0b3IuamF2YQ==)
 | `63.52% <63.52%> (ø)` | `25.00 <25.00> (?)` | |
   | 
[.../org/apache/hudi/operator/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZUZ1bmN0aW9uLmphdmE=)
 | `68.67% <68.67%> (ø)` | `14.00 <14.00> (?)` | |
   | 
[...n/java/org/apache/hudi/operator/HoodieOptions.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9Ib29kaWVPcHRpb25zLmphdmE=)
 | `74.35% <74.35%> (ø)` | `3.00 <3.00> (?)` | |
   | 

[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer

2021-01-11 Thread GitBox


codecov-io edited a comment on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=h1) Report
   > Merging 
[#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=desc) (7961488) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc)
 (7ce3ac7) will **decrease** coverage by `6.54%`.
   > The diff coverage is `59.79%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2430  +/-   ##
   
   - Coverage 50.76%   44.22%   -6.55% 
   + Complexity 3063 2265 -798 
   
 Files   419  329  -90 
 Lines 1877714395-4382 
 Branches   1918 1379 -539 
   
   - Hits   9533 6366-3167 
   + Misses 8468 7559 -909 
   + Partials776  470 -306 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.26% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `52.05% <33.33%> (-0.03%)` | `0.00 <2.00> (ø)` | |
   | hudiflink | `53.98% <59.95%> (+43.78%)` | `0.00 <56.00> (ø)` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==)
 | `57.07% <ø> (-0.21%)` | `41.00 <0.00> (ø)` | |
   | 
[...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/operator/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...ache/hudi/operator/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...i/common/model/OverwriteWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvUGF5bG9hZC5qYXZh)
 | `60.00% <33.33%> (-4.71%)` | `10.00 <2.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh)
 | `35.89% <47.36%> (+24.26%)` | `9.00 <8.00> (+6.00)` | |
   | 
[.../hudi/operator/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yQ29vcmRpbmF0b3IuamF2YQ==)
 | `63.52% <63.52%> (ø)` | `25.00 <25.00> (?)` | |
   | 
[.../org/apache/hudi/operator/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZUZ1bmN0aW9uLmphdmE=)
 | `68.67% <68.67%> (ø)` | `14.00 <14.00> (?)` | |
   | 
[...n/java/org/apache/hudi/operator/HoodieOptions.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9Ib29kaWVPcHRpb25zLmphdmE=)
 | `74.35% <74.35%> (ø)` | `3.00 <3.00> (?)` | |
   | 
[...he/hudi/operator/event/BatchWriteSuccessEvent.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9ldmVudC9CYXRjaFdyaXRlU3VjY2Vzc0V2ZW50LmphdmE=)
 | `100.00% <100.00%> (ø)` | `4.00 <4.00> (?)` | |
   | ... and 

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer

2021-01-11 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r554904669



##
File path: hudi-flink/src/main/java/org/apache/hudi/HoodieFlinkStreamer.java
##
@@ -160,6 +156,19 @@ public static void main(String[] args) throws Exception {
 + "(using the CLI parameter \"--props\") can also be passed command 
line using this parameter.")
 public List configs = new ArrayList<>();
 
+@Parameter(names = {"--record-key-field"}, description = "Record key 
field. Value to be used as the `recordKey` component of `HoodieKey`.\n"
++ "Actual value will be obtained by invoking .toString() on the field 
value. Nested fields can be specified using "
++ "the dot notation eg: `a.b.c`. By default `uuid`.")
+public String recordKeyField = "uuid";
+
+@Parameter(names = {"--partition-path-field"}, description = "Partition 
path field. Value to be used at \n"
++ "the `partitionPath` component of `HoodieKey`. Actual value obtained 
by invoking .toString(). By default `partitionpath`.")
+public String partitionPathField = "partitionpath";
+
+@Parameter(names = {"--partition-path-field"}, description = "Key 
generator class, that implements will extract the key out of incoming record.\n"

Review comment:
   Oops, thanks for the reminder.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer

2021-01-11 Thread GitBox


codecov-io commented on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=h1) Report
   > Merging 
[#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=desc) (7961488) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc)
 (7ce3ac7) will **decrease** coverage by `41.08%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2430   +/-   ##
   
   - Coverage 50.76%   9.68%   -41.09% 
   + Complexity 3063  48 -3015 
   
 Files   419  53  -366 
 Lines 187771930-16847 
 Branches   1918 230 -1688 
   
   - Hits   9533 187 -9346 
   + Misses 84681730 -6738 
   + Partials776  13  -763 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>