[jira] [Updated] (HUDI-5624) Fix HoodieAvroRecordMerger to use new precombine API
[ https://issues.apache.org/jira/browse/HUDI-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Scheller updated HUDI-5624: --- Description: HoodieAvroRecordMerger which exists to provide backwards compatibility for "precombine" and "combineAndGetUpdateValue" does not use the updated precombine API and currently uses the deprecated one. This causes issues for certain payloads. [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75] Deprecated precombine API: https://github.com/apache/hudi/blob/45da30dc3ecfd4e7315d8f33d95504a5ac7cbd1a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L44 was: HoodieAvroRecordMerger which exists to provide backwards compatibility for "precombine" and "combineAndGetUpdateValue" does not use the updated precombine API and currently uses the deprecated one. This causes issues for certain payloads. https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75 > Fix HoodieAvroRecordMerger to use new precombine API > > > Key: HUDI-5624 > URL: https://issues.apache.org/jira/browse/HUDI-5624 > Project: Apache Hudi > Issue Type: Bug >Reporter: Brandon Scheller >Priority: Major > > HoodieAvroRecordMerger which exists to provide backwards compatibility for > "precombine" and "combineAndGetUpdateValue" does not use the updated > precombine API and currently uses the deprecated one. This causes issues for > certain payloads. > [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75] > > Deprecated precombine API: > https://github.com/apache/hudi/blob/45da30dc3ecfd4e7315d8f33d95504a5ac7cbd1a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L44 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5624) Fix HoodieAvroRecordMerger to use new precombine API
Brandon Scheller created HUDI-5624: -- Summary: Fix HoodieAvroRecordMerger to use new precombine API Key: HUDI-5624 URL: https://issues.apache.org/jira/browse/HUDI-5624 Project: Apache Hudi Issue Type: Bug Reporter: Brandon Scheller HoodieAvroRecordMerger which exists to provide backwards compatibility for "precombine" and "combineAndGetUpdateValue" does not use the updated precombine API and currently uses the deprecated one. This causes issues for certain payloads. https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5526) Hive "Count" queries don't work with bootstrap tables w/Hive3
[ https://issues.apache.org/jira/browse/HUDI-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Scheller updated HUDI-5526: --- Description: Hive "count" queries fail on hudi bootstrap tables when they are using Hive3. This has been tested on all EMR-6.x releases and fails with the same error. The same query works with Hive2. For example with the query: {code:java} SELECT COUNT(*) FROM HUDI_BOOTSTRAP_TABLE;{code} Gives the following error: {code:java} TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : attempt_1672881902089_0008_1_00_00_1:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.io.IOException: cannot find dir = [s3://my-bucket/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet] in pathToPartitionInfo: [ [s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one] , [s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two] ] at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.io.IOException: cannot find dir = [s3://my-bucket/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet] in pathToPartitionInfo: [[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one], [s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]] at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266) ... 14 more Caused by: java.io.IOException: java.lang.RuntimeException: java.io.IOException: cannot find dir = [s3://my-bucket/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet] in pathToPartitionInfo: [[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one], [s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]] at [org.apache.hadoop.hive.io|http://org.apache.hadoop.hive.io/].HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at [org.apache.hadoop.hive.io|http://org.apache.hadoop.hive.io/].HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at [org.apache.hadoop.hive.ql.io|http://org.apache.hadoop.hive.ql.io/].HiveInputFormat.getRecordReader(HiveInputFormat.java:421) at
[jira] [Created] (HUDI-5526) Hive "Count" queries don't work with bootstrap tables w/Hive3
Brandon Scheller created HUDI-5526: -- Summary: Hive "Count" queries don't work with bootstrap tables w/Hive3 Key: HUDI-5526 URL: https://issues.apache.org/jira/browse/HUDI-5526 Project: Apache Hudi Issue Type: Bug Reporter: Brandon Scheller Hive "count" queries fail on hudi bootstrap tables when they are using Hive3. This has been tested on all EMR-6.x releases and fails with the same error. The same query works with Hive2. For example with the query: {code:java} SELECT COUNT(*) FROM HUDI_BOOTSTRAP_TABLE;{code} Gives the following error: {code:java} TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : attempt_1672881902089_0008_1_00_00_1:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.io.IOException: cannot find dir = [s3://yxchang-emr-dev/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet] in pathToPartitionInfo: [ [s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one] , [s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two] ] at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.io.IOException: cannot find dir = [s3://yxchang-emr-dev/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet] in pathToPartitionInfo: [[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one], [s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]] at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266) ... 14 more Caused by: java.io.IOException: java.lang.RuntimeException: java.io.IOException: cannot find dir = [s3://yxchang-emr-dev/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet] in pathToPartitionInfo: [[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one], [s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]] at [org.apache.hadoop.hive.io|http://org.apache.hadoop.hive.io/].HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at
[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied
[ https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175818#comment-17175818 ] Brandon Scheller commented on HUDI-1146: Fixed by https://github.com/apache/hudi/pull/1921 > DeltaStreamer fails to start when No updated records + schemaProvider not > supplied > -- > > Key: HUDI-1146 > URL: https://issues.apache.org/jira/browse/HUDI-1146 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Brandon Scheller >Priority: Major > > DeltaStreamer issue — happens with both COW or MOR - Restarting the > DeltaStreamer Process crashes, that is, 2nd Run does nothing. > Steps: > Run Hudi DeltaStreamer job in --continuous mode > Run the same job again without deleting the output parquet files generated > due to step above > 2nd run crashes with the below error ( it does not crash if we delete the > output parquet file) > {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a > valid schema provider class!}} > {{ at > org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}} > {{ at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} > > {{This looks to be because of this line:}} > {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315] > }} > The "orElse" block here doesn't seem to make sense as if "transformed" is > empty then it is likely "dataAndCheckpoint" will have a null schema provider -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1152) Skip hudi metadata columns during hive-sync
Brandon Scheller created HUDI-1152: -- Summary: Skip hudi metadata columns during hive-sync Key: HUDI-1152 URL: https://issues.apache.org/jira/browse/HUDI-1152 Project: Apache Hudi Issue Type: Improvement Reporter: Brandon Scheller Currently when Hudi syncs with Hive it syncs all columns. Some users would rather skip syncing these columns so their queries are cleaner. This Jira is to add an optional config to allow this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied
[ https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171596#comment-17171596 ] Brandon Scheller commented on HUDI-1146: spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --master yarn --deploy-mode client \ --table-type COPY_ON_WRITE \ --continuous \ --enable-hive-sync \ --min-sync-interval-seconds 60 \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ --target-base-path s3://pathtotable/table/ \ --target-table hudi_table \ --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --hoodie-conf hoodie.datasource.write.recordkey.field=“” \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.datasource.write.partitionpath.field=“XXX” \ --hoodie-conf hoodie.datasource.hive_sync.database=x \ --hoodie-conf hoodie.datasource.hive_sync.table=xx \ --hoodie-conf hoodie.datasource.hive_sync.partition_fields=“X” \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://pathtoinput/xxx/ > DeltaStreamer fails to start when No updated records + schemaProvider not > supplied > -- > > Key: HUDI-1146 > URL: https://issues.apache.org/jira/browse/HUDI-1146 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Brandon Scheller >Priority: Major > > DeltaStreamer issue — happens with both COW or MOR - Restarting the > DeltaStreamer Process crashes, that is, 2nd Run does nothing. > Steps: > Run Hudi DeltaStreamer job in --continuous mode > Run the same job again without deleting the output parquet files generated > due to step above > 2nd run crashes with the below error ( it does not crash if we delete the > output parquet file) > {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a > valid schema provider class!}} > {{ at > org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}} > {{ at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} > > {{This looks to be because of this line:}} > {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315] > }} > The "orElse" block here doesn't seem to make sense as if "transformed" is > empty then it is likely "dataAndCheckpoint" will have a null schema provider -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied
[ https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Scheller updated HUDI-1146: --- Description: DeltaStreamer issue — happens with both COW or MOR - Restarting the DeltaStreamer Process crashes, that is, 2nd Run does nothing. Steps: Run Hudi DeltaStreamer job in --continuous mode Run the same job again without deleting the output parquet files generated due to step above 2nd run crashes with the below error ( it does not crash if we delete the output parquet file) {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!}} {{ at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}} {{ at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}} {{ at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}} {{ at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} {{This looks to be because of this line:}} {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315] }} The "orElse" block here doesn't seem to make sense as if "transformed" is empty then it is likely "dataAndCheckpoint" will have a null schema provider was: DeltaStreamer issue — happens with both COW or MOR - Restarting the DeltaStreamer Process crashes, that is, 2nd Run does nothing. Steps: Run Hudi DeltaStreamer job in --continuous mode Run the same job again without deleting the output parquet files generated due to step above 2nd run crashes with the below error ( it does not crash if we delete the output parquet file) {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class! at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} > DeltaStreamer fails to start when No updated records + schemaProvider not > supplied > -- > > Key: HUDI-1146 > URL: https://issues.apache.org/jira/browse/HUDI-1146 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Brandon Scheller >Priority: Major > > DeltaStreamer issue — happens with both COW or MOR - Restarting the > DeltaStreamer Process crashes, that is, 2nd Run does nothing. > Steps: > Run Hudi DeltaStreamer job in --continuous mode > Run the same job again without deleting the output parquet files generated > due to step above > 2nd run crashes with the below error ( it does not crash if we delete the > output parquet file) > {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a > valid schema provider class!}} > {{ at > org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}} > {{ at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} > > {{This looks to be because of this line:}} > {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315] > }} > The "orElse" block here doesn't seem to make sense as if "transformed" is > empty then it is likely "dataAndCheckpoint" will have a null schema provider -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied
Brandon Scheller created HUDI-1146: -- Summary: DeltaStreamer fails to start when No updated records + schemaProvider not supplied Key: HUDI-1146 URL: https://issues.apache.org/jira/browse/HUDI-1146 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: Brandon Scheller DeltaStreamer issue — happens with both COW or MOR - Restarting the DeltaStreamer Process crashes, that is, 2nd Run does nothing. Steps: Run Hudi DeltaStreamer job in --continuous mode Run the same job again without deleting the output parquet files generated due to step above 2nd run crashes with the below error ( it does not crash if we delete the output parquet file) {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class! at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1040) Support Spark3 compatibility
Brandon Scheller created HUDI-1040: -- Summary: Support Spark3 compatibility Key: HUDI-1040 URL: https://issues.apache.org/jira/browse/HUDI-1040 Project: Apache Hudi Issue Type: Improvement Reporter: Brandon Scheller Update Hudi's usage of spark API's to remove deprecated spark APIs. Allow Hudi to be built with both Spark2 and Spark3. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-536) Update release notes to include KeyGenerator package changes
[ https://issues.apache.org/jira/browse/HUDI-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Scheller updated HUDI-536: -- Description: The change introduced here: [https://github.com/apache/incubator-hudi/pull/1194] Refactors hudi keygenerators into their own package. We need to make this a backwards compatible change or update the release notes to address this. Specifically: org.apache.hudi.ComplexKeyGenerator -> org.apache.hudi.keygen.ComplexKeyGenerator was: The change introduced here: [https://github.com/apache/incubator-hudi/pull/1194] Refactors hudi keygenerators into their own package. We need to make this a backwards compatible change or update the release notes to address this. > Update release notes to include KeyGenerator package changes > > > Key: HUDI-536 > URL: https://issues.apache.org/jira/browse/HUDI-536 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Brandon Scheller >Priority: Major > > The change introduced here: > [https://github.com/apache/incubator-hudi/pull/1194] > Refactors hudi keygenerators into their own package. > We need to make this a backwards compatible change or update the release > notes to address this. > Specifically: > org.apache.hudi.ComplexKeyGenerator -> > org.apache.hudi.keygen.ComplexKeyGenerator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-536) Update release notes to include KeyGenerator package changes
Brandon Scheller created HUDI-536: - Summary: Update release notes to include KeyGenerator package changes Key: HUDI-536 URL: https://issues.apache.org/jira/browse/HUDI-536 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Brandon Scheller The change introduced here: [https://github.com/apache/incubator-hudi/pull/1194] Refactors hudi keygenerators into their own package. We need to make this a backwards compatible change or update the release notes to address this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-520) Decide on keyGenerator strategy for handling null/empty recordkeys
Brandon Scheller created HUDI-520: - Summary: Decide on keyGenerator strategy for handling null/empty recordkeys Key: HUDI-520 URL: https://issues.apache.org/jira/browse/HUDI-520 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Brandon Scheller Currently key-generator implementations write out "__null__" for null values and "__empty__" for empty in order to provide a distinction between the two. This can add extra overhead to large datalakes that might not use this distinction. This Jira is to decide on a consistent strategy for handling null/empty record keys in key generators. The current strategy can be seen within ComplexKeyGenerator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-389) Updates sent to diff partition for a given key with Global Index
[ https://issues.apache.org/jira/browse/HUDI-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992077#comment-16992077 ] Brandon Scheller commented on HUDI-389: --- [~vinoth] [~shivnarayan] I was working on a similar issue and I think this fix is a better long term solution to what I was trying to accomplish here: [https://github.com/apache/incubator-hudi/pull/1052] . I'd love to see this merged as this is needed for creating a proper delete-by-record-key only API with global index. I can act as a reviewer for this code. > Updates sent to diff partition for a given key with Global Index > - > > Key: HUDI-389 > URL: https://issues.apache.org/jira/browse/HUDI-389 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Index >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.1 > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > Updates sent to diff partition for a given key with Global Index should > succeed by updating the record under original partition. As of now, it throws > exception. > [https://github.com/apache/incubator-hudi/issues/1021] > > > error log: > {code:java} > 14738 [Executor task launch worker-0] INFO > com.uber.hoodie.common.table.timeline.HoodieActiveTimeline - Loaded instants > java.util.stream.ReferencePipeline$Head@d02b1c7 > 14738 [Executor task launch worker-0] INFO > com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Building file > system view for partition (2016/04/15) > 14738 [Executor task launch worker-0] INFO > com.uber.hoodie.common.table.view.AbstractTableFileSystemView - #files found > in partition (2016/04/15) =0, Time taken =0 > 14738 [Executor task launch worker-0] INFO > com.uber.hoodie.common.table.view.AbstractTableFileSystemView - > addFilesToView: NumFiles=0, FileGroupsCreationTime=0, StoreTimeTaken=0 > 14738 [Executor task launch worker-0] INFO > com.uber.hoodie.common.table.view.HoodieTableFileSystemView - Adding > file-groups for partition :2016/04/15, #FileGroups=0 > 14738 [Executor task launch worker-0] INFO > com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Time to load > partition (2016/04/15) =0 > 14754 [Executor task launch worker-0] ERROR > com.uber.hoodie.table.HoodieCopyOnWriteTable - Error upserting bucketType > UPDATE for partition :0 > java.util.NoSuchElementException: No value present > at com.uber.hoodie.common.util.Option.get(Option.java:112) > at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71) > at > com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226) > at > com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180) > at > com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263) > at > com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at
[jira] [Created] (HUDI-327) Introduce "null" supporting ComplexKeyGenerator
Brandon Scheller created HUDI-327: - Summary: Introduce "null" supporting ComplexKeyGenerator Key: HUDI-327 URL: https://issues.apache.org/jira/browse/HUDI-327 Project: Apache Hudi (incubating) Issue Type: Improvement Reporter: Brandon Scheller Customers have been running into issues where they would like to use a record_key from columns that can contain null values. Currently, this will cause Hudi to crash and throw a cryptic exception.(improving error messaging is a separate but related issue) We would like to propose a new KeyGenerator based on ComplexKeyGenerator that allows for null record_keys. At a basic level, using the key generator without any options would essentially allow a null record_key to be accepted. (It can be replaced with an empty string, null, or some predefined "null" string representation) This comes with the negative side effect that all records with a null record_key would then be associated together. To work around this, you would be able to specify a secondary record_key to be used in the case that the first one is null. You would specify this in the same way that you do for the ComplexKeyGenerator as a comma separated list of record_keys. In this case, when the first key is seen as null then the second key will be used instead. We could support any arbitrary limit of record_keys here. While we are aware there are many alternatives to avoid using a null record_key. We believe this will act as a usability improvement so that new users are not forced to clean/update their data in order to use Hudi. We are hoping to get some feedback on the idea -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-327) Introduce "null" supporting KeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Scheller updated HUDI-327: -- Summary: Introduce "null" supporting KeyGenerator (was: Introduce "null" supporting ComplexKeyGenerator) > Introduce "null" supporting KeyGenerator > > > Key: HUDI-327 > URL: https://issues.apache.org/jira/browse/HUDI-327 > Project: Apache Hudi (incubating) > Issue Type: Improvement >Reporter: Brandon Scheller >Priority: Major > > Customers have been running into issues where they would like to use a > record_key from columns that can contain null values. Currently, this will > cause Hudi to crash and throw a cryptic exception.(improving error messaging > is a separate but related issue) > We would like to propose a new KeyGenerator based on ComplexKeyGenerator that > allows for null record_keys. > At a basic level, using the key generator without any options would > essentially allow a null record_key to be accepted. (It can be replaced with > an empty string, null, or some predefined "null" string representation) > This comes with the negative side effect that all records with a null > record_key would then be associated together. To work around this, you would > be able to specify a secondary record_key to be used in the case that the > first one is null. You would specify this in the same way that you do for the > ComplexKeyGenerator as a comma separated list of record_keys. In this case, > when the first key is seen as null then the second key will be used instead. > We could support any arbitrary limit of record_keys here. > While we are aware there are many alternatives to avoid using a null > record_key. We believe this will act as a usability improvement so that new > users are not forced to clean/update their data in order to use Hudi. > We are hoping to get some feedback on the idea > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-326) Support deleting records with only record_key
[ https://issues.apache.org/jira/browse/HUDI-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969684#comment-16969684 ] Brandon Scheller commented on HUDI-326: --- Additionally, does anyone have some context on how difficult this implementation would be? It seems like Hudi doesn't really have any way to track its own partitions, so it looks like we'd have to scan for all the partitions within the table if we wanted to implement something like this. > Support deleting records with only record_key > - > > Key: HUDI-326 > URL: https://issues.apache.org/jira/browse/HUDI-326 > Project: Apache Hudi (incubating) > Issue Type: Improvement >Reporter: Brandon Scheller >Priority: Major > > Currently Hudi requires 3 things to issue a hard delete using > EmptyHoodieRecordPayload. It requires (record_key, partition_key, > precombine_key). > This means that in many real use scenarios, you are required to issue a > select query to find the partition_key and possibly precombine_key for a > certain record before deleting it. > We would like to avoid this extra step by being allowed to issue a delete > based on only the record_key of a record. > This means that it would blanket delete all records with that specific > record_key across all partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-326) Support deleting records with only record_key
Brandon Scheller created HUDI-326: - Summary: Support deleting records with only record_key Key: HUDI-326 URL: https://issues.apache.org/jira/browse/HUDI-326 Project: Apache Hudi (incubating) Issue Type: Improvement Reporter: Brandon Scheller Currently Hudi requires 3 things to issue a hard delete using EmptyHoodieRecordPayload. It requires (record_key, partition_key, precombine_key). This means that in many real use scenarios, you are required to issue a select query to find the partition_key and possibly precombine_key for a certain record before deleting it. We would like to avoid this extra step by being allowed to issue a delete based on only the record_key of a record. This means that it would blanket delete all records with that specific record_key across all partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file
[ https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961530#comment-16961530 ] Brandon Scheller commented on HUDI-305: --- >From initial investigation it looks like this will require changes in Presto >to support the combination of custom file split + record reader. A similar scenario as PrestoParquetRealtimeInputFormat where custom splits/reader are required is asked about here: [https://groups.google.com/forum/#!topic/presto-users/oObR8FES_zI] Looking at the code it looks like Presto cannot support the current Hudi use case without modification on the Presto side. I was wondering what approach is currently being looked at? > Presto MOR "_rt" queries only reads base parquet file > -- > > Key: HUDI-305 > URL: https://issues.apache.org/jira/browse/HUDI-305 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Presto Integration > Environment: On AWS EMR >Reporter: Brandon Scheller >Assignee: Bhavani Sudha Saktheeswaran >Priority: Major > Fix For: 0.5.1 > > > Code example to reproduce. > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > val df = Seq( > ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), > ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), > ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), > ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > var tableName = "hudi_events_mor_1" > var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName > // write hudi dataset > df.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > // update a record with event_name "event_name_123" => "event_name_changed" > val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*") > val df2 = df1.filter($"event_id" === "104") > val df3 = df2.withColumn("event_name", lit("event_name_changed")) > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) >.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") >.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") >.option("hoodie.compact.inline", "false") >.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") >.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) >.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") >.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") >.mode(SaveMode.Append) >.save(tablePath) > {code} > Now when querying the real-time table from Hive, we have no issue seeing the > updated value: > {code:java} > hive> select event_name from hudi_events_mor_1_rt; > OK > event_name_900 > event_name_changed > event_name_546 > event_name_678 > Time taken: 0.103 seconds, Fetched: 4 row(s) > {code} > But when querying the real-time table from Presto, we only read the base > parquet file and do not see the update that should be merged in from the log > file. > {code:java} > presto:default> select event_name from hudi_events_mor_1_rt; >event_name > > event_name_900 > event_name_123 > event_name_546 > event_name_678 > (4 rows) > {code} >
[jira] [Updated] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file
[ https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Scheller updated HUDI-305: -- Environment: On AWS EMR > Presto MOR "_rt" queries only reads base parquet file > -- > > Key: HUDI-305 > URL: https://issues.apache.org/jira/browse/HUDI-305 > Project: Apache Hudi (incubating) > Issue Type: Bug > Environment: On AWS EMR >Reporter: Brandon Scheller >Priority: Major > > Code example to reproduce. > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > val df = Seq( > ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), > ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), > ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), > ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > var tableName = "hudi_events_mor_1" > var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName > // write hudi dataset > df.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > // update a record with event_name "event_name_123" => "event_name_changed" > val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*") > val df2 = df1.filter($"event_id" === "104") > val df3 = df2.withColumn("event_name", lit("event_name_changed")) > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) >.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") >.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") >.option("hoodie.compact.inline", "false") >.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") >.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) >.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") >.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") >.mode(SaveMode.Append) >.save(tablePath) > {code} > Now when querying the real-time table from Hive, we have no issue seeing the > updated value: > {code:java} > hive> select event_name from hudi_events_mor_1_rt; > OK > event_name_900 > event_name_changed > event_name_546 > event_name_678 > Time taken: 0.103 seconds, Fetched: 4 row(s) > {code} > But when querying the real-time table from Presto, we only read the base > parquet file and do not see the update that should be merged in from the log > file. > {code:java} > presto:default> select event_name from hudi_events_mor_1_rt; >event_name > > event_name_900 > event_name_123 > event_name_546 > event_name_678 > (4 rows) > {code} > Our current understanding of this issue is that while the > HoodieParquetRealtimeInputFormat correctly generates the splits. The > RealtimeCompactedRecordReader record reader is not used so it is not reading > the log file and only reading the base parquet file. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file
Brandon Scheller created HUDI-305: - Summary: Presto MOR "_rt" queries only reads base parquet file Key: HUDI-305 URL: https://issues.apache.org/jira/browse/HUDI-305 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Brandon Scheller Code example to reproduce. {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveMode val df = Seq( ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_events_mor_1" var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // update a record with event_name "event_name_123" => "event_name_changed" val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*") val df2 = df1.filter($"event_id" === "104") val df3 = df2.withColumn("event_name", lit("event_name_changed")) // update hudi dataset df3.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option("hoodie.compact.inline", "false") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) {code} Now when querying the real-time table from Hive, we have no issue seeing the updated value: {code:java} hive> select event_name from hudi_events_mor_1_rt; OK event_name_900 event_name_changed event_name_546 event_name_678 Time taken: 0.103 seconds, Fetched: 4 row(s) {code} But when querying the real-time table from Presto, we only read the base parquet file and do not see the update that should be merged in from the log file. {code:java} presto:default> select event_name from hudi_events_mor_1_rt; event_name event_name_900 event_name_123 event_name_546 event_name_678 (4 rows) {code} Our current understanding of this issue is that while the HoodieParquetRealtimeInputFormat correctly generates the splits. The RealtimeCompactedRecordReader record reader is not used so it is not reading the log file and only reading the base parquet file. -- This message was sent by Atlassian Jira (v8.3.4#803005)