[GitHub] [hudi] n3nash edited a comment on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
n3nash edited a comment on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-758475039 @pratyakshsharma in that case, can you review this PR ? @prashantwason Had missed to push some local changes, can you take another pass, I think it should address all your comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua edited a comment on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism
yanghua edited a comment on pull request #2433: URL: https://github.com/apache/hudi/pull/2433#issuecomment-758474842 > The file check of each task is useless because even if a task of the source has no data for some time interval, the checkpoint still can trigger normally. So all task checkpoint successfully does not mean there is data. (I have solved this in mu #step1 PR though) That's true. We could remove the file check logic. While the implementation of multiple parallelisms is reasonable. > There is no need to checkpoint the write status in KeyedWriteProcessOperator, because we can not start a new instant if the last instant failes, the more proper/simple way is to retry the commit actions several times and trigger failover if still fails. Sounds reasonable, agree. > BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many bugs and has many improvements. After that i would add a compatible pipeline and this PR can apply there, and i can help to review. I personally think that a better form of community participation is: 1) Control the granularity of changes; 2) Each submission is a complete function point, so that the working behavior of the code does not change; I actually want to know if you have the idea of the compatible version of `OperatorCoordinator`. What is the design? Will it be better than the current one? Will it be better than this PR design? All this is opaque. Your Step 1 is a large-scale refactoring, and the merged code will make this client immediately unavailable. It is currently in a critical period before the release of 0.7 (if we do not have the energy to merge step 2, 3 in the short term?). Why can't we optimize it step by step? In fact, the first and second step you need to optimize is File Assigner. SF Express has already implemented it and is ready to provide PR. In fact, we improve on the existing basis, and risks and changes are controllable, right? I think that in the end, we provide a more "elegant implementation" of OperatorCoordinator for the higher version, which is the correct order. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
n3nash commented on a change in pull request #2424: URL: https://github.com/apache/hudi/pull/2424#discussion_r71734 ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord left, GenericRecord righ return result; } - /** - * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old - * schema. - */ - public static GenericRecord rewriteRecord(GenericRecord record, Schema newSchema) { -return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), newSchema), newSchema); - } - /** * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the new * schema. + * NOTE: Here, the assumption is that you cannot go from an evolved schema (schema with (N) fields) + * to an older schema (schema with (N-1) fields). All fields present in the older record schema MUST be present in the + * new schema and the default/existing values are carried over. + * This particular method does the following things : + * a) Create a new empty GenericRecord with the new schema. + * b) Set default values for all fields of this transformed schema in the new GenericRecord or copy over the data + * from the old schema to the new schema + * c) hoodie_metadata_fields have a special treatment. This is done because for code generated AVRO classes + * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type instead of a GenericRecord. + * SpecificBaseRecord throws null pointer exception for record.get(name) if name is not present in the schema of the + * record (which happens when converting a SpecificBaseRecord without hoodie_metadata_fields to a new record with. */ - public static GenericRecord rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) { -return rewrite(record, new LinkedHashSet<>(newSchema.getFields()), newSchema); - } - - private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { + public static GenericRecord rewriteRecord(GenericRecord oldRecord, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); -for (Schema.Field f : fieldsToWrite) { - if (record.get(f.name()) == null) { +boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase; +for (Schema.Field f : newSchema.getFields()) { + if (!isMetadataField(f.name()) && oldRecord.get(f.name()) == null) { +// if not metadata field, set defaults if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } - } else { -newRecord.put(f.name(), record.get(f.name())); + } else if (!isMetadataField(f.name())) { +// if not metadata field, copy old value +newRecord.put(f.name(), oldRecord.get(f.name())); + } else if (!isSpecificRecord) { +// if not specific record, copy value for hoodie metadata fields as well Review comment: We don't need to, updated the comments, please take a read and let me know if it's clear This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
n3nash commented on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-758475039 @pratyakshsharma in that case, can you review this PR ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism
yanghua commented on pull request #2433: URL: https://github.com/apache/hudi/pull/2433#issuecomment-758474842 > The file check of each task is useless because even if a task of the source has no data for some time interval, the checkpoint still can trigger normally. So all task checkpoint successfully does not mean there is data. (I have solved this in mu #step1 PR though) That's true. We could remove the file check logic. While the implementation of multiple parallelisms is reasonable. > There is no need to checkpoint the write status in KeyedWriteProcessOperator, because we can not start a new instant if the last instant failes, the more proper/simple way is to retry the commit actions several times and trigger failover if still fails. Sounds reasonable, agree. > BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many bugs and has many improvements. After that i would add a compatible pipeline and this PR can apply there, and i can help to review. I personally think that a better form of community participation is: 1) Control the granularity of changes; 2) Each submission is a complete function point, so that the working behavior of the code does not change; I actually want to know if you have removed the compatible version of `OperatorCoordinator`. What is the design? Will it be better than the current one? Will it be better than this PR design? All this is opaque. Your Step 1 is a large-scale refactoring, and the merged code will make this client immediately unavailable. It is currently in a critical period before the release of 0.7 (if we do not have the energy to merge step 2, 3 in the short term?). Why can't we optimize it step by step? In fact, the first and second step you need to optimize is File Assigner. SF Express has already implemented it and is ready to provide PR. In fact, we improve on the existing basis, and risks and changes are controllable, right? I think that in the end, we provide a more "elegant implementation" of OperatorCoordinator for the higher version, which is the correct order. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
pratyakshsharma commented on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-758461542 @n3nash In my previous org, we were dealing with a similar scenario where fields were getting deleted from few tables in production. Yeah parquet-avro reader will throw exception in the scenario you mentioned. We were actually using schema-registry to create and store an uber schema so that every field is present in the final schema before actually writing to parquet files. We created the uber schema at the start of DeltaStreamer, and used the same for the ingestion. I guess all this is beyond the scope of this PR. We can initiate a separate discussion to support deletion of fields from schema. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] loukey-lj opened a new pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism
loukey-lj opened a new pull request #2434: URL: https://github.com/apache/hudi/pull/2434 InstantGenerateOperator support multiple parallelism. When InstantGenerateOperator subtask size greater than 1 we can set subtask 0 as a main subtask, only main task create new instant. The prerequisite of create new instant is exist subtask received data in current checkpoint. Every subtask will create a tmp file, flie name is make up by checkpointid,subtask index and received records size. The main subtask will check every subtask file and parse file to make sure is shuold to create new instant. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
n3nash commented on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-758449461 @pratyakshsharma Do you have a use-case of deleting fields ? What is the reason for supporting deleting fields. Has deleting fields case been tested for all types of cases such as upserts ? Generally, the parquet-avro reader will throw an exception right now when a smaller schema (schema for which a field has been deleted) is used to read a parquet file written with a larger schema. Have you tested this scenario ? If not, I suggest we revert this particular change and think of a more holistic way to support deletion of fields from schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] loukey-lj closed pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism
loukey-lj closed pull request #2433: URL: https://github.com/apache/hudi/pull/2433 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #2414: [SUPPORT]
bvaradar closed issue #2414: URL: https://github.com/apache/hudi/issues/2414 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar edited a comment on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
bvaradar edited a comment on issue #2423: URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327 Hudi does not synchronize on partition path creation. Instead, each executor task which is about to write to a parquet file ensures the directory path exists by issuing fs.mkdirs call. Added : https://issues.apache.org/jira/browse/HUDI-1523 If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call with getFileStatus() - ``` diff --git a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java index d148b1b8..11b3cb49 100644 --- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java +++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java @@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle extends H public Path makeNewPath(String partitionPath) { Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath); try { - fs.mkdirs(path); // create a new partition as needed. + if (!fs.exists(path)) { +fs.mkdirs(path); // create a new partition as needed. + } } catch (IOException e) { throw new HoodieIOException("Failed to make dir " + path, e); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
bvaradar commented on issue #2423: URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327 Hudi does not synchronize on partition path creation. Instead, each executor task which is about to write to a parquet file ensures the directory path exists by issuing fs.mkdirs call. Added : https://issues.apache.org/jira/browse/HUDI-1523 If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call with getFileStatus() - `diff --git a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java index d148b1b8..11b3cb49 100644 --- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java +++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java @@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle extends H public Path makeNewPath(String partitionPath) { Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath); try { - fs.mkdirs(path); // create a new partition as needed. + if (!fs.exists(path)) { +fs.mkdirs(path); // create a new partition as needed. + } } catch (IOException e) { throw new HoodieIOException("Failed to make dir " + path, e); }` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1523) Avoid excessive mkdir calls when creating new files
Balaji Varadarajan created HUDI-1523: Summary: Avoid excessive mkdir calls when creating new files Key: HUDI-1523 URL: https://issues.apache.org/jira/browse/HUDI-1523 Project: Apache Hudi Issue Type: Improvement Components: Writer Core Reporter: Balaji Varadarajan Fix For: 0.8.0 https://github.com/apache/hudi/issues/2423 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] Karl-WangSK commented on pull request #2260: [HUDI-1381] Schedule compaction based on time elapsed
Karl-WangSK commented on pull request #2260: URL: https://github.com/apache/hudi/pull/2260#issuecomment-758424635 @wangxianghu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2432: [SUPPORT] write hudi data failed when using Deltastreamer
bvaradar commented on issue #2432: URL: https://github.com/apache/hudi/issues/2432#issuecomment-758403084 @quitozang : Binding to port 0 should ensure that OS assigns a random free port. I am not sure why you are seeing the error. You can workaround by setting hoodie.embed.timeline.server.port to a designated port between (a number between 1025 and 65535) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism
danny0405 commented on pull request #2433: URL: https://github.com/apache/hudi/pull/2433#issuecomment-758372866 The file check of each task is useless because even if a task of the source has no data for some time interval, the checkpoint still can trigger normally. So all task checkpoint successfully does not mean there is data. (I have solved this in mu #step1 PR though) There is no need to checkpoint the write status in KeyedWriteProcessOperator, because we can not start a new instant if the last instant failes, the more proper/simple way is to retry the commit actions several times and trigger failover if still fails. BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many bugs and has many improvements. After that i would add a compatible pipeline and this PR can apply there, and i can help to review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jtmzheng commented on issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset
jtmzheng commented on issue #2408: URL: https://github.com/apache/hudi/issues/2408#issuecomment-758360941 Thanks Udit! I'd tried setting `hoodie.commits.archival.batch` to 5 earlier today after going through the source code - that got my application back and running again. The first bug definitely seems like the root cause, after turning on more verbose logging I found several 300mb commit files being loaded in for archival before the crash (re: the second bug https://github.com/apache/hudi/blob/e3d3677b7e7899705b624925666317f0c074f7c7/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L353 clears the list, which isn't the most intuitive). It seems like these large commit files were generated when I set `hoodie.cleaner.commits.retained` to 1. What is the trade-off in lowering `hoodie.keep.max.commits` and `hoodie.keep.min.commits`? I couldn't find much good documentation on the archival process/configs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on a change in pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3
garyli1019 commented on a change in pull request #2412: URL: https://github.com/apache/hudi/pull/2412#discussion_r555477972 ## File path: pom.xml ## @@ -1361,6 +1363,7 @@ ${fasterxml.spark3.version} ${fasterxml.spark3.version} ${fasterxml.spark3.version} +true Review comment: Do we need a vice versa? spark 2 skip unit tests of spark 3? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback
lw309637554 commented on pull request #2418: URL: https://github.com/apache/hudi/pull/2418#issuecomment-758337679 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback
lw309637554 commented on a change in pull request #2418: URL: https://github.com/apache/hudi/pull/2418#discussion_r555456066 ## File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/HoodieClientRollbackTestBase.java ## @@ -96,4 +99,61 @@ protected void twoUpsertCommitDataWithTwoPartitions(List firstPartiti assertEquals(1, secondPartitionCommit2FileSlices.size()); } } + + protected void insertOverwriteCommitDataWithTwoPartitions(List firstPartitionCommit2FileSlices, +List secondPartitionCommit2FileSlices, +HoodieWriteConfig cfg, +boolean commitSecondInsertOverwrite) throws IOException { +//just generate two partitions +dataGen = new HoodieTestDataGenerator(new String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH}); +HoodieTestDataGenerator.writePartitionMetadata(fs, new String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH}, basePath); +SparkRDDWriteClient client = getHoodieWriteClient(cfg); +/** + * Write 1 (upsert) + */ +String newCommitTime = "001"; +List records = dataGen.generateInsertsContainsAllPartitions(newCommitTime, 2); Review comment: ok,Make sense This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-1520) add configure for spark sql overwrite use replace
[ https://issues.apache.org/jira/browse/HUDI-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1520. - Resolution: Fixed > add configure for spark sql overwrite use replace > - > > Key: HUDI-1520 > URL: https://issues.apache.org/jira/browse/HUDI-1520 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism
yanghua commented on pull request #2433: URL: https://github.com/apache/hudi/pull/2433#issuecomment-758327239 @danny0405 wdyt about this optimization? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1511) InstantGenerateOperator support multiple parallelism
[ https://issues.apache.org/jira/browse/HUDI-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1511: - Labels: pull-request-available (was: ) > InstantGenerateOperator support multiple parallelism > > > Key: HUDI-1511 > URL: https://issues.apache.org/jira/browse/HUDI-1511 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: loukey_j >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism
yanghua commented on pull request #2433: URL: https://github.com/apache/hudi/pull/2433#issuecomment-758321935 @loukey-lj thanks for your contribution! Can you please: 1) Fix the Travis issue? It's red now; 2) Update the RFC-13 and describe your optimization. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset
umehrot2 commented on issue #2408: URL: https://github.com/apache/hudi/issues/2408#issuecomment-758321313 For now, I would suggest to archive at smaller intervals. May be try out something like: - `hoodie.keep.max.commits`: 10 - `hoodie.keep.min.commits`: 10 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset
umehrot2 commented on issue #2408: URL: https://github.com/apache/hudi/issues/2408#issuecomment-758320870 I took a deeper look at this. For you this seems to be happening in the archival code path: ``` at org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309) at org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282) at org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133) at org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381) ``` In `HoodieTimelineArchiveLog` where it needs to write log files with commit record, similar to how log files are written for MOR tables. However, in this code I notice a couple of issues: - The default maximum log block size of 256 MB defined [here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51), is not utilized for this class and is only used for the MOR log blocks writing case. As a result, there is no real control over the block size that it can end up writing which can potentially overflow `ByteArrayOutputStream` whose maximum size is `Integer.MAX_VALE - 8`. That is what seems to be happening in this scenario here because of an integer overflow following that code path inside `ByteArrayOutputStream`. So we need to use the maximum block size concept here as well. - In addition I see a bug in code [here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302) where even after flushing out the records into a file after a batch size of 10 (default) it is not clearing the list and just goes on accumulating the records. This seems logically wrong as well (duplication), apart from the fact that it would keep increasing the log file blocks size it is writing. I will open a jira for this issue to track this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1502] MOR rollback and restore support for metadata sync (#2421)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e3d3677 [HUDI-1502] MOR rollback and restore support for metadata sync (#2421) e3d3677 is described below commit e3d3677b7e7899705b624925666317f0c074f7c7 Author: Sivabalan Narayanan AuthorDate: Mon Jan 11 16:23:13 2021 -0500 [HUDI-1502] MOR rollback and restore support for metadata sync (#2421) - Adds field to RollbackMetadata that capture the logs written for rollback blocks - Adds field to RollbackMetadata that capture new logs files written by unsynced deltacommits Co-authored-by: Vinoth Chandar --- .../org/apache/hudi/io/HoodieAppendHandle.java | 1 + .../AbstractMarkerBasedRollbackStrategy.java | 26 +++-- .../BaseMergeOnReadRollbackActionExecutor.java | 3 +- .../hudi/table/action/rollback/RollbackUtils.java | 6 +- .../rollback/ListingBasedRollbackHelper.java | 21 +++- .../rollback/SparkMarkerBasedRollbackStrategy.java | 14 +++ .../hudi/metadata/TestHoodieBackedMetadata.java| 43 --- .../rollback/TestMarkerBasedRollbackStrategy.java | 130 +++-- .../src/main/avro/HoodieRollbackMetadata.avsc | 16 ++- .../org/apache/hudi/common/HoodieRollbackStat.java | 20 +++- .../java/org/apache/hudi/common/fs/FSUtils.java| 13 ++- .../table/timeline/TimelineMetadataUtils.java | 28 ++--- .../hudi/metadata/HoodieTableMetadataUtil.java | 34 -- .../hudi/common/table/TestTimelineUtils.java | 27 ++--- .../table/view/TestIncrementalFSViewSync.java | 2 +- 15 files changed, 268 insertions(+), 116 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java index 3faac2e..c6ea7ba 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java @@ -164,6 +164,7 @@ public class HoodieAppendHandle extends // Since the actual log file written to can be different based on when rollover happens, we use the // base file to denote some log appends happened on a slice. writeToken will still fence concurrent // writers. +// https://issues.apache.org/jira/browse/HUDI-1517 createMarkerFile(partitionPath, FSUtils.makeDataFileName(baseInstantTime, writeToken, fileId, hoodieTable.getBaseFileExtension())); this.writer = createLogWriter(fileSlice, baseInstantTime); diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java index cb6fff9..cc596ba 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/AbstractMarkerBasedRollbackStrategy.java @@ -53,9 +53,9 @@ public abstract class AbstractMarkerBasedRollbackStrategy table, HoodieEngineContext context, HoodieWriteConfig config, String instantTime) { this.table = table; @@ -90,6 +90,7 @@ public abstract class AbstractMarkerBasedRollbackStrategy writtenLogFileSizeMap = getWrittenLogFileSizeMap(partitionPath, baseCommitTime, fileId); HoodieLogFormat.Writer writer = null; try { @@ -121,17 +122,26 @@ public abstract class AbstractMarkerBasedRollbackStrategy filesToNumBlocksRollback = Collections.emptyMap(); -if (config.useFileListingMetadata()) { - // When metadata is enabled, the information of files appended to is required - filesToNumBlocksRollback = Collections.singletonMap( +// the information of files appended to is required for metadata sync +Map filesToNumBlocksRollback = Collections.singletonMap( table.getMetaClient().getFs().getFileStatus(Objects.requireNonNull(writer).getLogFile().getPath()), 1L); -} return HoodieRollbackStat.newBuilder() .withPartitionPath(partitionPath) .withRollbackBlockAppendResults(filesToNumBlocksRollback) -.build(); +.withWrittenLogFileSizeMap(writtenLogFileSizeMap).build(); + } + + /** + * Returns written log file size map for the respective baseCommitTime to assist in metadata table syncing. + * @param partitionPath partition path of interest + * @param baseCommitTime base commit time of interest + * @param fileId fileId of interest + * @return Map + * @throws IOException + */ + protected Map getWrittenLogFileSizeMap(String partitionPath, String
[GitHub] [hudi] vinothchandar merged pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync
vinothchandar merged pull request #2421: URL: https://github.com/apache/hudi/pull/2421 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3
codecov-io edited a comment on pull request #2412: URL: https://github.com/apache/hudi/pull/2412#issuecomment-755726635 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3
codecov-io edited a comment on pull request #2412: URL: https://github.com/apache/hudi/pull/2412#issuecomment-755726635 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=h1) Report > Merging [#2412](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=desc) (9b9a5c9) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.54%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2412/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2412 +/- ## - Coverage 50.23% 9.68% -40.55% + Complexity 2985 48 -2937 Files 410 53 -357 Lines 183981930-16468 Branches 1884 230 -1654 - Hits 9242 187 -9055 + Misses 83981730 -6668 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.97%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2412?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2412/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | |
[GitHub] [hudi] codecov-io edited a comment on pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync
codecov-io edited a comment on pull request #2421: URL: https://github.com/apache/hudi/pull/2421#issuecomment-757112911 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=h1) Report > Merging [#2421](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=desc) (c2647c3) into [master](https://codecov.io/gh/apache/hudi/commit/c151147819abb6b4c418138d9b401f374fc021a0?el=desc) (c151147) will **decrease** coverage by `0.35%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2421/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2421 +/- ## === - Coverage 10.04% 9.68% -0.36% Complexity 48 48 === Files52 53 +1 Lines 18521930 +78 Branches223 230 +7 === + Hits186 187 +1 - Misses 16531730 +77 Partials 13 13 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `9.68% <ø> (-0.36%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2421?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `34.10% <0.00%> (-1.44%)` | `11.00% <0.00%> (ø%)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | | | [...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (?%)` | | | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2421/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <0.00%> (+0.14%)` | `28.00% <0.00%> (ø%)` | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on issue #2346: [SUPPORT]The rt view query returns a wrong result with predicate push down.
satishkotha commented on issue #2346: URL: https://github.com/apache/hudi/issues/2346#issuecomment-758169722 @sumihehe Did you get a chance to look at above? It'll be helpful if you can provide more information. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
pratyakshsharma commented on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-758162231 @n3nash Just a high level thought before going through the changes thoroughly. How about keeping the old changes also and introduce a config `setDefaultValueWithSchemaEvolutionByDeletingFields` to support schema evolutions in case of deletion of a field? By default we can keep it as false to avoid the degradation as pointed out by @prashantwason . Thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
pratyakshsharma commented on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-758154222 > @n3nash what is the commit being reverted? https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262865#comment-17262865 ] Pratyaksh Sharma commented on HUDI-1509: [~nishith29] The PR was a generic one where point #2 mentioned by you was also supported. To answer your other question, I did not notice any such degradation. > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.6.0, 0.6.1, 0.7.0 >Reporter: Prashant Wason >Assignee: Nishith Agarwal >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
prashantwason commented on a change in pull request #2424: URL: https://github.com/apache/hudi/pull/2424#discussion_r555258349 ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord left, GenericRecord righ return result; } - /** - * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old - * schema. - */ - public static GenericRecord rewriteRecord(GenericRecord record, Schema newSchema) { -return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), newSchema), newSchema); - } - /** * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the new * schema. + * NOTE: Here, the assumption is that you cannot go from an evolved schema (schema with (N) fields) + * to an older schema (schema with (N-1) fields). All fields present in the older record schema MUST be present in the + * new schema and the default/existing values are carried over. + * This particular method does the following things : + * a) Create a new empty GenericRecord with the new schema. + * b) Set default values for all fields of this transformed schema in the new GenericRecord or copy over the data + * from the old schema to the new schema + * c) hoodie_metadata_fields have a special treatment. This is done because for code generated AVRO classes + * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type instead of a GenericRecord. + * SpecificBaseRecord throws null pointer exception for record.get(name) if name is not present in the schema of the + * record (which happens when converting a SpecificBaseRecord without hoodie_metadata_fields to a new record with. */ - public static GenericRecord rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) { -return rewrite(record, new LinkedHashSet<>(newSchema.getFields()), newSchema); - } - - private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { + public static GenericRecord rewriteRecord(GenericRecord oldRecord, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); -for (Schema.Field f : fieldsToWrite) { - if (record.get(f.name()) == null) { +boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase; +for (Schema.Field f : newSchema.getFields()) { + if (!isMetadataField(f.name()) && oldRecord.get(f.name()) == null) { +// if not metadata field, set defaults if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } - } else { -newRecord.put(f.name(), record.get(f.name())); + } else if (!isMetadataField(f.name())) { +// if not metadata field, copy old value +newRecord.put(f.name(), oldRecord.get(f.name())); + } else if (!isSpecificRecord) { +// if not specific record, copy value for hoodie metadata fields as well Review comment: This seems counter-intuitive to the comment in the method. If SpecificRecord.get() throws NULL exception if the field is not there, wont we want to populate the metadata fields for it? ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord left, GenericRecord righ return result; } - /** - * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old - * schema. - */ - public static GenericRecord rewriteRecord(GenericRecord record, Schema newSchema) { -return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), newSchema), newSchema); - } - /** * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the new * schema. + * NOTE: Here, the assumption is that you cannot go from an evolved schema (schema with (N) fields) + * to an older schema (schema with (N-1) fields). All fields present in the older record schema MUST be present in the + * new schema and the default/existing values are carried over. + * This particular method does the following things : + * a) Create a new empty GenericRecord with the new schema. + * b) Set default values for all fields of this transformed schema in the new GenericRecord or copy over the data + * from the old schema to the new schema + * c) hoodie_metadata_fields have a special treatment. This is done because for code generated AVRO classes + * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type instead of a GenericRecord. + * SpecificBaseRecord throws null
[GitHub] [hudi] satishkotha commented on a change in pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback
satishkotha commented on a change in pull request #2418: URL: https://github.com/apache/hudi/pull/2418#discussion_r555258959 ## File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/HoodieClientRollbackTestBase.java ## @@ -96,4 +99,61 @@ protected void twoUpsertCommitDataWithTwoPartitions(List firstPartiti assertEquals(1, secondPartitionCommit2FileSlices.size()); } } + + protected void insertOverwriteCommitDataWithTwoPartitions(List firstPartitionCommit2FileSlices, Review comment: Yes, I want to add test for MOR too. But because of time constraints just added one for COW. I plan to add MOR test later on. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2418: [HUDI-1266] Add unit test for validating replacecommit rollback
satishkotha commented on a change in pull request #2418: URL: https://github.com/apache/hudi/pull/2418#discussion_r555258581 ## File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/HoodieClientRollbackTestBase.java ## @@ -96,4 +99,61 @@ protected void twoUpsertCommitDataWithTwoPartitions(List firstPartiti assertEquals(1, secondPartitionCommit2FileSlices.size()); } } + + protected void insertOverwriteCommitDataWithTwoPartitions(List firstPartitionCommit2FileSlices, +List secondPartitionCommit2FileSlices, +HoodieWriteConfig cfg, +boolean commitSecondInsertOverwrite) throws IOException { +//just generate two partitions +dataGen = new HoodieTestDataGenerator(new String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH}); +HoodieTestDataGenerator.writePartitionMetadata(fs, new String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH}, basePath); +SparkRDDWriteClient client = getHoodieWriteClient(cfg); +/** + * Write 1 (upsert) + */ +String newCommitTime = "001"; +List records = dataGen.generateInsertsContainsAllPartitions(newCommitTime, 2); Review comment: We can only reuse two lines because some of the state generated (List) is needed for subsequent operations. So i'm not inclined to add another method for this. let me know if you have strong opinion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1291) integration of replace with consolidated metadata
[ https://issues.apache.org/jira/browse/HUDI-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish closed HUDI-1291. Resolution: Fixed done as part of HUDI-1276 > integration of replace with consolidated metadata > - > > Key: HUDI-1291 > URL: https://issues.apache.org/jira/browse/HUDI-1291 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > > Consolidated metadata supports two operations: > 1) add files > 2) remove files > If file group is replaced, we don't want to immediately remove files from > metadata (to support restore/rollback). So we may have to change consolidated > metadata to track additional information for replaced fileIds > Today, Replaced FileIds are tracked as part of .replacecommit file. We can > continue this for short term, but may be more efficient to combine with > consolidated metadata -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
prashantwason commented on a change in pull request #2424: URL: https://github.com/apache/hudi/pull/2424#discussion_r555221595 ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord left, GenericRecord righ return result; } - /** - * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old - * schema. - */ - public static GenericRecord rewriteRecord(GenericRecord record, Schema newSchema) { -return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), newSchema), newSchema); - } - /** * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the new * schema. + * NOTE: Here, the assumption is that you cannot go from an evolved schema (schema with (N) fields) + * to an older schema (schema with (N-1) fields). All fields present in the older record schema MUST be present in the + * new schema and the default/existing values are carried over. + * This particular method does the following things : + * a) Create a new empty GenericRecord with the new schema. + * b) Set default values for all fields of this transformed schema in the new GenericRecord or copy over the data + * from the old schema to the new schema + * c) hoodie_metadata_fields have a special treatment. This is done because for code generated AVRO classes + * (only HoodieMetadataRecord), the avro record is a SpecificBaseRecord type instead of a GenericRecord. + * SpecificBaseRecord throws null pointer exception for record.get(name) if name is not present in the schema of the + * record (which happens when converting a SpecificBaseRecord without hoodie_metadata_fields to a new record with. */ - public static GenericRecord rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) { -return rewrite(record, new LinkedHashSet<>(newSchema.getFields()), newSchema); - } - - private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { + public static GenericRecord rewriteRecord(GenericRecord oldRecord, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); -for (Schema.Field f : fieldsToWrite) { - if (record.get(f.name()) == null) { +boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase; +for (Schema.Field f : newSchema.getFields()) { + if (!isMetadataField(f.name()) && oldRecord.get(f.name()) == null) { Review comment: Maybe create a local variable with value of oldRecord.get(f.name()) as this is a hash lookup within GenericRecord. This value is used at many places in this function. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
prashantwason commented on a change in pull request #2424: URL: https://github.com/apache/hudi/pull/2424#discussion_r555219601 ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -292,53 +284,57 @@ public static GenericRecord stitchRecords(GenericRecord left, GenericRecord righ return result; } - /** - * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the old - * schema. - */ - public static GenericRecord rewriteRecord(GenericRecord record, Schema newSchema) { -return rewrite(record, getCombinedFieldsToWrite(record.getSchema(), newSchema), newSchema); - } - /** * Given a avro record with a given schema, rewrites it into the new schema while setting fields only from the new * schema. + * NOTE: Here, the assumption is that you cannot go from an evolved schema (schema with (N) fields) + * to an older schema (schema with (N-1) fields). All fields present in the older record schema MUST be present in the + * new schema and the default/existing values are carried over. + * This particular method does the following things : + * a) Create a new empty GenericRecord with the new schema. + * b) Set default values for all fields of this transformed schema in the new GenericRecord or copy over the data Review comment: Better to reword it the other way - copy over the data from old schema or (if absent) set default value of the field. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3
zhedoubushishi commented on a change in pull request #2412: URL: https://github.com/apache/hudi/pull/2412#discussion_r555214109 ## File path: pom.xml ## @@ -110,9 +110,10 @@ 2.4.4 3.0.0 1.8.2 -2.11.12 +2.11.12 Review comment: Make sense to me. For the consistency reason, I converted ```scala-11``` back to ```scala11```. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync
vinothchandar commented on pull request #2421: URL: https://github.com/apache/hudi/pull/2421#issuecomment-758098600 @nsivabalan pushed some small fixes. Please land once CI passes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2421: [HUDI-1502] MOR rollback and restore support for metadata sync
vinothchandar commented on a change in pull request #2421: URL: https://github.com/apache/hudi/pull/2421#discussion_r555199075 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java ## @@ -262,18 +264,33 @@ private static void processRollbackMetadata(HoodieRollbackMetadata rollbackMetad partitionToDeletedFiles.get(partition).addAll(deletedFiles); } - if (!pm.getAppendFiles().isEmpty()) { + if (hasRollbackLogFiles) { if (!partitionToAppendedFiles.containsKey(partition)) { partitionToAppendedFiles.put(partition, new HashMap<>()); } // Extract appended file name from the absolute paths saved in getAppendFiles() -pm.getAppendFiles().forEach((path, size) -> { +pm.getRollbackLogFiles().forEach((path, size) -> { partitionToAppendedFiles.get(partition).merge(new Path(path).getName(), size, (oldSize, newSizeCopy) -> { return size + oldSize; Review comment: we should change it here too. the picking of largest ## File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java ## @@ -424,10 +425,29 @@ public static boolean isLogFile(Path logPath) { */ public static Stream getAllLogFiles(FileSystem fs, Path partitionPath, final String fileId, final String logFileExtension, final String baseCommitTime) throws IOException { -return Arrays -.stream(fs.listStatus(partitionPath, -path -> path.getName().startsWith("." + fileId) && path.getName().contains(logFileExtension))) -.map(HoodieLogFile::new).filter(s -> s.getBaseCommitTime().equals(baseCommitTime)); +try { + return Arrays + .stream(fs.listStatus(partitionPath, + path -> path.getName().startsWith("." + fileId) && path.getName().contains(logFileExtension))) + .map(HoodieLogFile::new).filter(s -> s.getBaseCommitTime().equals(baseCommitTime)); +} catch (FileNotFoundException e) { + return Stream.builder().build(); +} + } + + /** + * Get all the log files for the passed in FileId in the partition path. + */ + public static Stream getAllLogFiles(FileSystem fs, Path partitionPath, Review comment: is this used? need to check ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java ## @@ -116,6 +118,12 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient metaClient, HoodieWriteC .withDeletedFileResults(filesToDeletedStatus).build()); } case APPEND_ROLLBACK_BLOCK: { + // collect all log files that is supposed to be deleted with this rollback + Map writtenLogFileSizeMap = FSUtils.getAllLogFiles(metaClient.getFs(), Review comment: its done below anyway. So all good ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java ## @@ -116,6 +118,12 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient metaClient, HoodieWriteC .withDeletedFileResults(filesToDeletedStatus).build()); } case APPEND_ROLLBACK_BLOCK: { + // collect all log files that is supposed to be deleted with this rollback + Map writtenLogFileSizeMap = FSUtils.getAllLogFiles(metaClient.getFs(), Review comment: i think we should guard the option variables here. `rollbackRequest.getFileId()`, `rollbackRequest.getLatestBaseInstant()` with a `isPresent()` check This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar merged pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE
vinothchandar merged pull request #2428: URL: https://github.com/apache/hudi/pull/2428 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRITE_TABLE (#2428)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new de42adc [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRITE_TABLE (#2428) de42adc is described below commit de42adc2302528541145a714078cc6d10cbb8d9a Author: lw0090 AuthorDate: Tue Jan 12 01:07:47 2021 +0800 [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRITE_TABLE (#2428) --- .../main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala | 13 ++--- .../org/apache/hudi/functional/TestCOWDataSource.scala | 3 ++- .../org/apache/hudi/functional/TestMORDataSource.scala | 1 - 3 files changed, 8 insertions(+), 9 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index 472f450..4e9caa5 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -94,13 +94,6 @@ private[hudi] object HoodieSparkSqlWriter { operation = WriteOperationType.INSERT } -// If the mode is Overwrite, can set operation to INSERT_OVERWRITE_TABLE. -// Then in DataSourceUtils.doWriteOperation will use client.insertOverwriteTable to overwrite -// the table. This will replace the old fs.delete(tablepath) mode. -if (mode == SaveMode.Overwrite && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) { - operation = WriteOperationType.INSERT_OVERWRITE_TABLE -} - val jsc = new JavaSparkContext(sparkContext) val basePath = new Path(path.get) val instantTime = HoodieActiveTimeline.createNewInstantTime() @@ -340,6 +333,12 @@ private[hudi] object HoodieSparkSqlWriter { if (operation != WriteOperationType.DELETE) { if (mode == SaveMode.ErrorIfExists && tableExists) { throw new HoodieException(s"hoodie table at $tablePath already exists.") + } else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) { +// When user set operation as INSERT_OVERWRITE_TABLE, +// overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation +log.warn(s"hoodie table at $tablePath already exists. Deleting existing data & overwriting with new data.") +fs.delete(tablePath, true) +tableExists = false } } else { // Delete Operation only supports Append mode diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala index 730b7d2..b15a7d4 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala @@ -202,7 +202,7 @@ class TestCOWDataSource extends HoodieClientTestBase { val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2)) inputDF2.write.format("org.apache.hudi") .options(commonOpts) - .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) + .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL) .mode(SaveMode.Overwrite) .save(basePath) @@ -229,6 +229,7 @@ class TestCOWDataSource extends HoodieClientTestBase { val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2)) inputDF2.write.format("org.apache.hudi") .options(commonOpts) + .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL) .mode(SaveMode.Overwrite) .save(basePath) diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala index 121957e..1ea6ceb 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala @@ -278,7 +278,6 @@ class TestMORDataSource extends HoodieClientTestBase { val inputDF5: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records5, 2)) inputDF5.write.format("org.apache.hudi") .options(commonOpts) - .option("hoodie.compact.inline", "true") .mode(SaveMode.Append) .save(basePath) val commit5Time =
[GitHub] [hudi] vinothchandar commented on a change in pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE
vinothchandar commented on a change in pull request #2428: URL: https://github.com/apache/hudi/pull/2428#discussion_r555203377 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -340,6 +333,12 @@ private[hudi] object HoodieSparkSqlWriter { if (operation != WriteOperationType.DELETE) { if (mode == SaveMode.ErrorIfExists && tableExists) { throw new HoodieException(s"hoodie table at $tablePath already exists.") + } else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) { Review comment: this is inline with https://github.com/apache/hudi/blob/b6d363c0d4a6906ef1c9ea54c0263c6b5915e0b6/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L301 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE
vinothchandar commented on a change in pull request #2428: URL: https://github.com/apache/hudi/pull/2428#discussion_r555202292 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala ## @@ -278,7 +278,6 @@ class TestMORDataSource extends HoodieClientTestBase { val inputDF5: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records5, 2)) inputDF5.write.format("org.apache.hudi") .options(commonOpts) - .option("hoodie.compact.inline", "true") Review comment: This is great. Thanks for diggin in both ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2428: [HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT_TABLE
vinothchandar commented on a change in pull request #2428: URL: https://github.com/apache/hudi/pull/2428#discussion_r555202292 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala ## @@ -278,7 +278,6 @@ class TestMORDataSource extends HoodieClientTestBase { val inputDF5: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records5, 2)) inputDF5.write.format("org.apache.hudi") .options(commonOpts) - .option("hoodie.compact.inline", "true") Review comment: This is great. Thanks for diggin in both @lw309637554 ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pranotishanbhag removed a comment on issue #2414: [SUPPORT]
pranotishanbhag removed a comment on issue #2414: URL: https://github.com/apache/hudi/issues/2414#issuecomment-758042202 Hi, I tried copy_on_write with insert mode for 4.6 TB dataset which is failing with lost nodes (previously tried bulk_insert which worked fine). I tried to tweak the executor memory and also changes the config "hoodie.copyonwrite.insert.split.size" to 100k. It still failed. Later i tried to add the config "hoodie.insert.shuffle.parallelism" and set it to a higher value and still I see failures. Please can you tell me what is wrong here. Thanks, Pranoti This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pranotishanbhag commented on issue #2414: [SUPPORT]
pranotishanbhag commented on issue #2414: URL: https://github.com/apache/hudi/issues/2414#issuecomment-758042202 Hi, I tried copy_on_write with insert mode for 4.6 TB dataset which is failing with lost nodes (previously tried bulk_insert which worked fine). I tried to tweak the executor memory and also changes the config "hoodie.copyonwrite.insert.split.size" to 100k. It still failed. Later i tried to add the config "hoodie.insert.shuffle.parallelism" and set it to a higher value and still I see failures. Please can you tell me what is wrong here. Thanks, Pranoti This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] loukey-lj opened a new pull request #2433: Hudi 1511
loukey-lj opened a new pull request #2433: URL: https://github.com/apache/hudi/pull/2433 InstantGenerateOperator support multiple parallelism. When InstantGenerateOperator subtask size greater than 1 we can set subtask 0 as a main subtask, only main task create new instant. The prerequisite of create new instant is exist subtask received data in current checkpoint. Every subtask will create a tmp file, flie name is make up by checkpointid,subtask index and received records size. The main subtask will check every subtask file and parse file to make sure is shuold to create new instant. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on a change in pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query
garyli1019 commented on a change in pull request #2378: URL: https://github.com/apache/hudi/pull/2378#discussion_r555064333 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala ## @@ -108,7 +111,7 @@ class MergeOnReadSnapshotRelation(val sqlContext: SQLContext, dataSchema = tableStructSchema, partitionSchema = StructType(Nil), Review comment: hi @yui2010 , how's your dataset looks like? Does it has a `dt` column in the dataset? The partitioning I am referring to is that when you `spark.read.format('hudi').load(basePath)` and your dataset folder structure looks like `basePath/dt=20201010`, then Spark is able to append a `dt` column to your dataset. When you do sth like `df.filter(dt=20201010)`, spark will go to this partition and read the file. How's your workflow to load your data and pass the partition information to Spark? In order to get more information about this implementation, would you write a test to demo the partition pruning? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yui2010 commented on a change in pull request #2427: [HUDI-1519] Improve minKey/maxKey compute in HoodieHFileWriter
yui2010 commented on a change in pull request #2427: URL: https://github.com/apache/hudi/pull/2427#discussion_r555027450 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java ## @@ -121,17 +121,10 @@ public void writeAvro(String recordKey, IndexedRecord object) throws IOException if (hfileConfig.useBloomFilter()) { hfileConfig.getBloomFilter().add(recordKey); - if (minRecordKey != null) { -minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey; - } else { + if (minRecordKey == null) { Review comment: Hi @vinothchandar, Thanks for reviewing. it not computing the min/max. it only use the first recordKey and the last recordKey as min/max(HoodieSortedMergeHandle/BaseSparkCommitActionExecutor already compare the input records in order by recordKey) . It's like hbase store keyRange(firstKey/lastKey) as [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292](https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292) and actually we also can get min/max RecordKey from HFile native throught `HFileReaderV3#getFirstKey()` and `HFileReaderV3#getLastRowKey()` from load-on-open section and i think we can use current implement(put min/max in FileInfo map). maybe we will add more properties. for example: add recordCount so we can choose seekto or loadall in `HoodieHFileReader#filterRowKeys` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yui2010 commented on a change in pull request #2427: [HUDI-1519] Improve minKey/maxKey compute in HoodieHFileWriter
yui2010 commented on a change in pull request #2427: URL: https://github.com/apache/hudi/pull/2427#discussion_r555027450 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java ## @@ -121,17 +121,10 @@ public void writeAvro(String recordKey, IndexedRecord object) throws IOException if (hfileConfig.useBloomFilter()) { hfileConfig.getBloomFilter().add(recordKey); - if (minRecordKey != null) { -minRecordKey = minRecordKey.compareTo(recordKey) <= 0 ? minRecordKey : recordKey; - } else { + if (minRecordKey == null) { Review comment: Hi @vinothchandar, Thanks for reviewing. it not computing the min/max. it only use the first recordKey and the last recordKey as min/max(HoodieSortedMergeHandle/BaseSparkCommitActionExecutor already compare the input records in order by recordKey) . It's like hbase store keyRange(firstKey/lastKey) as [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292](https://github.com/apache/hbase/blob/rel/1.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L292) and actually we also can get min/max RecordKey from HFile native thought `HFileReaderV3#getFirstKey()` and `HFileReaderV3#getLastRowKey()` from load-on-open section and i think we can use current implement(put min/max in FileInfo map). maybe we will add more properties. for example: add recordCount so we can choose seekto or loadall in `HoodieHFileReader#filterRowKeys` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2431: [HUDI-2431]translate the api partitionBy to hoodie.datasource.write.partitionpath.field
codecov-io commented on pull request #2431: URL: https://github.com/apache/hudi/pull/2431#issuecomment-757929313 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=h1) Report > Merging [#2431](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=desc) (fa597aa) into [master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc) (7ce3ac7) will **decrease** coverage by `41.08%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2431/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2431 +/- ## - Coverage 50.76% 9.68% -41.09% + Complexity 3063 48 -3015 Files 419 53 -366 Lines 187771930-16847 Branches 1918 230 -1688 - Hits 9533 187 -9346 + Misses 84681730 -6738 + Partials776 13 -763 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
[GitHub] [hudi] quitozang opened a new issue #2432: [SUPPORT] write hudi data failed when using Deltastreamer
quitozang opened a new issue #2432: URL: https://github.com/apache/hudi/issues/2432 When i write hudi data using DeltaStreamer, sometimes will get this error below **Environment Description** * Hudi version : 0.6.0 * Spark version : 2.4.4 * Hive version : 2.1.1 * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no **Additional context** 2021-01-11 19:29:01,966 [pool-19-thread-1] INFO io.javalin.Javalin - Starting Javalin ... 2021-01-11 19:29:01,975 [pool-19-thread-1] ERROR io.javalin.Javalin - Failed to start Javalin 2021-01-11 19:29:01,975 [pool-19-thread-1] ERROR org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Shutting down delta-sync due to exception java.lang.RuntimeException: Port already in use. Make sure no other process is using port 0 and try again. at io.javalin.Javalin.start(Javalin.java:157) at io.javalin.Javalin.start(Javalin.java:119) at org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:106) at org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74) at org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:105) at org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:72) at org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:81) at org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:121) at org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:108) at org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:104) at org.apache.hudi.utilities.deltastreamer.DeltaSync.setupWriteClientAddColumn(DeltaSync.java:673) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:309) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:609) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:0 at org.apache.hudi.org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:346) at org.apache.hudi.org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:308) at org.apache.hudi.org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.apache.hudi.org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:236) at org.apache.hudi.org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.apache.hudi.org.eclipse.jetty.server.Server.doStart(Server.java:394) at org.apache.hudi.org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at io.javalin.core.util.JettyServerUtil.initialize(JettyServerUtil.kt:110) at io.javalin.Javalin.start(Javalin.java:140) ... 16 more Caused by: java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:461) at java.base/sun.nio.ch.Net.bind(Net.java:453) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.apache.hudi.org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:342) ... 24 more 2021-01-11 19:29:01,976 [pool-19-thread-1] INFO org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Delta Sync shutdown. Error ?true 2021-01-11 19:29:01,978 [Monitor Thread] ERROR org.apache.hudi.async.AbstractAsyncService - Monitor noticed one or more threads failed. Requesting graceful shutdown of other threads java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: java.lang.RuntimeException: Port already in use. Make sure no other process is using port 0 and try again. at io.javalin.Javalin.start(Javalin.java:157) at io.javalin.Javalin.start(Javalin.java:119) at org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:106) at org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
[GitHub] [hudi] teeyog opened a new pull request #2431: translate the api partitionBy to `hoodie.dat…
teeyog opened a new pull request #2431: URL: https://github.com/apache/hudi/pull/2431 …asource.write.partitionpath.field` ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2424: [HUDI-1509]: Reverting LinkedHashSet changes to fix performance degradation for large schemas
codecov-io edited a comment on pull request #2424: URL: https://github.com/apache/hudi/pull/2424#issuecomment-757403445 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=h1) Report > Merging [#2424](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=desc) (37126a3) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2424/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2424 +/- ## - Coverage 50.53% 9.68% -40.85% + Complexity 3032 48 -2984 Files 417 53 -364 Lines 187271930-16797 Branches 1917 230 -1687 - Hits 9463 187 -9276 + Misses 84891730 -6759 + Partials775 13 -762 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.73%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2424?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2424/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer
codecov-io edited a comment on pull request #2430: URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=h1) Report > Merging [#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=desc) (7961488) into [master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc) (7ce3ac7) will **increase** coverage by `0.81%`. > The diff coverage is `59.79%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2430 +/- ## + Coverage 50.76% 51.58% +0.81% - Complexity 3063 3109 +46 Files 419 418 -1 Lines 1877718975 +198 Branches 1918 1931 +13 + Hits 9533 9789 +256 + Misses 8468 8389 -79 - Partials776 797 +21 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.26% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `52.05% <33.33%> (-0.03%)` | `0.00 <2.00> (ø)` | | | hudiflink | `53.98% <59.95%> (+43.78%)` | `0.00 <56.00> (ø)` | | | hudihadoopmr | `33.06% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `66.07% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `66.84% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `57.07% <ø> (-0.21%)` | `41.00 <0.00> (ø)` | | | [...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/operator/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...ache/hudi/operator/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...i/common/model/OverwriteWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvUGF5bG9hZC5qYXZh) | `60.00% <33.33%> (-4.71%)` | `10.00 <2.00> (ø)` | | | [...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh) | `35.89% <47.36%> (+24.26%)` | `9.00 <8.00> (+6.00)` | | | [.../hudi/operator/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yQ29vcmRpbmF0b3IuamF2YQ==) | `63.52% <63.52%> (ø)` | `25.00 <25.00> (?)` | | | [.../org/apache/hudi/operator/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZUZ1bmN0aW9uLmphdmE=) | `68.67% <68.67%> (ø)` | `14.00 <14.00> (?)` | | | [...n/java/org/apache/hudi/operator/HoodieOptions.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9Ib29kaWVPcHRpb25zLmphdmE=) | `74.35% <74.35%> (ø)` | `3.00 <3.00> (?)` | | |
[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer
codecov-io edited a comment on pull request #2430: URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=h1) Report > Merging [#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=desc) (7961488) into [master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc) (7ce3ac7) will **decrease** coverage by `6.54%`. > The diff coverage is `59.79%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2430 +/- ## - Coverage 50.76% 44.22% -6.55% + Complexity 3063 2265 -798 Files 419 329 -90 Lines 1877714395-4382 Branches 1918 1379 -539 - Hits 9533 6366-3167 + Misses 8468 7559 -909 + Partials776 470 -306 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.26% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `52.05% <33.33%> (-0.03%)` | `0.00 <2.00> (ø)` | | | hudiflink | `53.98% <59.95%> (+43.78%)` | `0.00 <56.00> (ø)` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `57.07% <ø> (-0.21%)` | `41.00 <0.00> (ø)` | | | [...main/java/org/apache/hudi/HoodieFlinkStreamer.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVGbGlua1N0cmVhbWVyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/operator/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...ache/hudi/operator/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yRmFjdG9yeS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...i/common/model/OverwriteWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvUGF5bG9hZC5qYXZh) | `60.00% <33.33%> (-4.71%)` | `10.00 <2.00> (ø)` | | | [...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh) | `35.89% <47.36%> (+24.26%)` | `9.00 <8.00> (+6.00)` | | | [.../hudi/operator/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZU9wZXJhdG9yQ29vcmRpbmF0b3IuamF2YQ==) | `63.52% <63.52%> (ø)` | `25.00 <25.00> (?)` | | | [.../org/apache/hudi/operator/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9TdHJlYW1Xcml0ZUZ1bmN0aW9uLmphdmE=) | `68.67% <68.67%> (ø)` | `14.00 <14.00> (?)` | | | [...n/java/org/apache/hudi/operator/HoodieOptions.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9Ib29kaWVPcHRpb25zLmphdmE=) | `74.35% <74.35%> (ø)` | `3.00 <3.00> (?)` | | | [...he/hudi/operator/event/BatchWriteSuccessEvent.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9ldmVudC9CYXRjaFdyaXRlU3VjY2Vzc0V2ZW50LmphdmE=) | `100.00% <100.00%> (ø)` | `4.00 <4.00> (?)` | | | ... and
[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer
danny0405 commented on a change in pull request #2430: URL: https://github.com/apache/hudi/pull/2430#discussion_r554904669 ## File path: hudi-flink/src/main/java/org/apache/hudi/HoodieFlinkStreamer.java ## @@ -160,6 +156,19 @@ public static void main(String[] args) throws Exception { + "(using the CLI parameter \"--props\") can also be passed command line using this parameter.") public List configs = new ArrayList<>(); +@Parameter(names = {"--record-key-field"}, description = "Record key field. Value to be used as the `recordKey` component of `HoodieKey`.\n" ++ "Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using " ++ "the dot notation eg: `a.b.c`. By default `uuid`.") +public String recordKeyField = "uuid"; + +@Parameter(names = {"--partition-path-field"}, description = "Partition path field. Value to be used at \n" ++ "the `partitionPath` component of `HoodieKey`. Actual value obtained by invoking .toString(). By default `partitionpath`.") +public String partitionPathField = "partitionpath"; + +@Parameter(names = {"--partition-path-field"}, description = "Key generator class, that implements will extract the key out of incoming record.\n" Review comment: Oops, thanks for the reminder. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2430: [HUDI-1522] Remove the single parallelism operator from the Flink writer
codecov-io commented on pull request #2430: URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=h1) Report > Merging [#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=desc) (7961488) into [master](https://codecov.io/gh/apache/hudi/commit/7ce3ac778eb475bf23ffa31243dc0843ec7d089a?el=desc) (7ce3ac7) will **decrease** coverage by `41.08%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2430 +/- ## - Coverage 50.76% 9.68% -41.09% + Complexity 3063 48 -3015 Files 419 53 -366 Lines 187771930-16847 Branches 1918 230 -1688 - Hits 9533 187 -9346 + Misses 84681730 -6738 + Partials776 13 -763 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>