[jira] [Updated] (HUDI-5624) Fix HoodieAvroRecordMerger to use new precombine API

2023-01-26 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-5624:
---
Description: 
HoodieAvroRecordMerger which exists to provide backwards compatibility for 
"precombine" and "combineAndGetUpdateValue" does not use the updated precombine 
API and currently uses the deprecated one. This causes issues for certain 
payloads. 

[https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75]

 

Deprecated precombine API:

https://github.com/apache/hudi/blob/45da30dc3ecfd4e7315d8f33d95504a5ac7cbd1a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L44

  was:
HoodieAvroRecordMerger which exists to provide backwards compatibility for 
"precombine" and "combineAndGetUpdateValue" does not use the updated precombine 
API and currently uses the deprecated one. This causes issues for certain 
payloads. 

https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75


> Fix HoodieAvroRecordMerger to use new precombine API
> 
>
> Key: HUDI-5624
> URL: https://issues.apache.org/jira/browse/HUDI-5624
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Brandon Scheller
>Priority: Major
>
> HoodieAvroRecordMerger which exists to provide backwards compatibility for 
> "precombine" and "combineAndGetUpdateValue" does not use the updated 
> precombine API and currently uses the deprecated one. This causes issues for 
> certain payloads. 
> [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75]
>  
> Deprecated precombine API:
> https://github.com/apache/hudi/blob/45da30dc3ecfd4e7315d8f33d95504a5ac7cbd1a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L44



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5624) Fix HoodieAvroRecordMerger to use new precombine API

2023-01-26 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-5624:
--

 Summary: Fix HoodieAvroRecordMerger to use new precombine API
 Key: HUDI-5624
 URL: https://issues.apache.org/jira/browse/HUDI-5624
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Brandon Scheller


HoodieAvroRecordMerger which exists to provide backwards compatibility for 
"precombine" and "combineAndGetUpdateValue" does not use the updated precombine 
API and currently uses the deprecated one. This causes issues for certain 
payloads. 

https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerger.java#L75



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5526) Hive "Count" queries don't work with bootstrap tables w/Hive3

2023-01-10 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-5526:
---
Description: 
Hive "count" queries fail on hudi bootstrap tables when they are using Hive3. 
This has been tested on all EMR-6.x releases and fails with the same error. The 
same query works with Hive2.

For example with the query:
{code:java}
SELECT COUNT(*) FROM HUDI_BOOTSTRAP_TABLE;{code}
Gives the following error:
{code:java}
TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : 
attempt_1672881902089_0008_1_00_00_1:java.lang.RuntimeException: 
java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: 
java.io.IOException: cannot find dir = 
[s3://my-bucket/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet]
 in pathToPartitionInfo: [
[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one]
, 
[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]
]
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: java.io.IOException: 
java.lang.RuntimeException: java.io.IOException: cannot find dir = 
[s3://my-bucket/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet]
 in pathToPartitionInfo: 
[[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one], 
[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]]
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
... 14 more
Caused by: java.io.IOException: java.lang.RuntimeException: 
java.io.IOException: cannot find dir = 
[s3://my-bucket/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet]
 in pathToPartitionInfo: 
[[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one], 
[s3://my-bucket/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]]
at 
[org.apache.hadoop.hive.io|http://org.apache.hadoop.hive.io/].HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
[org.apache.hadoop.hive.io|http://org.apache.hadoop.hive.io/].HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
[org.apache.hadoop.hive.ql.io|http://org.apache.hadoop.hive.ql.io/].HiveInputFormat.getRecordReader(HiveInputFormat.java:421)
at 

[jira] [Created] (HUDI-5526) Hive "Count" queries don't work with bootstrap tables w/Hive3

2023-01-10 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-5526:
--

 Summary: Hive "Count" queries don't work with bootstrap tables 
w/Hive3
 Key: HUDI-5526
 URL: https://issues.apache.org/jira/browse/HUDI-5526
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Brandon Scheller


Hive "count" queries fail on hudi bootstrap tables when they are using Hive3. 
This has been tested on all EMR-6.x releases and fails with the same error. The 
same query works with Hive2.


For example with the query:
{code:java}
SELECT COUNT(*) FROM HUDI_BOOTSTRAP_TABLE;{code}
Gives the following error:
{code:java}
TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : 
attempt_1672881902089_0008_1_00_00_1:java.lang.RuntimeException: 
java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: 
java.io.IOException: cannot find dir = 
[s3://yxchang-emr-dev/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet]
 in pathToPartitionInfo: [
[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one]
, 
[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]
]
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: java.io.IOException: 
java.lang.RuntimeException: java.io.IOException: cannot find dir = 
[s3://yxchang-emr-dev/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet]
 in pathToPartitionInfo: 
[[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one],
 
[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]]
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
... 14 more
Caused by: java.io.IOException: java.lang.RuntimeException: 
java.io.IOException: cannot find dir = 
[s3://yxchang-emr-dev/test-data/hudi/parquet-source-tables/hive_style_partitioned_tb/event_type=two/part-0-98fb0380-374c-40f5-8a57-89d95270a2c3-c000.parquet]
 in pathToPartitionInfo: 
[[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=one],
 
[s3://bschelle-emr-dev2/hudi-table/test_bootstrap_hive_partitionedrt/event_type=two]]
at 
[org.apache.hadoop.hive.io|http://org.apache.hadoop.hive.io/].HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 

[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175818#comment-17175818
 ] 

Brandon Scheller commented on HUDI-1146:


Fixed by https://github.com/apache/hudi/pull/1921

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1152) Skip hudi metadata columns during hive-sync

2020-08-05 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-1152:
--

 Summary: Skip hudi metadata columns during hive-sync
 Key: HUDI-1152
 URL: https://issues.apache.org/jira/browse/HUDI-1152
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Brandon Scheller


Currently when Hudi syncs with Hive it syncs all columns. Some users would 
rather skip syncing these columns so their queries are cleaner. This Jira is to 
add an optional config to allow this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-05 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171596#comment-17171596
 ] 

Brandon Scheller commented on HUDI-1146:


spark-submit 
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ 
--master yarn --deploy-mode client \ 
--table-type COPY_ON_WRITE \ 
--continuous \ 
--enable-hive-sync \ 
--min-sync-interval-seconds 60 \ 
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ 
--target-base-path s3://pathtotable/table/ \
--target-table hudi_table \ 
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
\ 
--hoodie-conf hoodie.datasource.write.recordkey.field=“” \ 
--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 \ 
--hoodie-conf hoodie.datasource.write.partitionpath.field=“XXX” \ 
--hoodie-conf hoodie.datasource.hive_sync.database=x \ 
--hoodie-conf hoodie.datasource.hive_sync.table=xx \ 
--hoodie-conf hoodie.datasource.hive_sync.partition_fields=“X” \ 
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 \ 
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://pathtoinput/xxx/

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-03 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-1146:
---
Description: 
DeltaStreamer issue — happens with both COW or MOR - Restarting the 
DeltaStreamer Process crashes, that is, 2nd Run does nothing.

Steps:
 Run Hudi DeltaStreamer job in --continuous mode
 Run the same job again without deleting the output parquet files generated due 
to step above
 2nd run crashes with the below error ( it does not crash if we delete the 
output parquet file)

{{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid 
schema provider class!}}
{{ at 
org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
{{ at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
{{ at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
{{ at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}

 

{{This looks to be because of this line:}}
{{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
 }}

The "orElse" block here doesn't seem to make sense as if "transformed" is empty 
then it is likely "dataAndCheckpoint" will have a null schema provider

  was:
DeltaStreamer issue — happens with both COW or MOR - Restarting the 
DeltaStreamer Process crashes, that is, 2nd Run does nothing.

Steps:
 Run Hudi DeltaStreamer job in --continuous mode
 Run the same job again without deleting the output parquet files generated due 
to step above
 2nd run crashes with the below error ( it does not crash if we delete the 
output parquet file)

{{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid 
schema provider class!
at 
org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}

 


> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-03 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-1146:
--

 Summary: DeltaStreamer fails to start when No updated records + 
schemaProvider not supplied
 Key: HUDI-1146
 URL: https://issues.apache.org/jira/browse/HUDI-1146
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: Brandon Scheller


DeltaStreamer issue — happens with both COW or MOR - Restarting the 
DeltaStreamer Process crashes, that is, 2nd Run does nothing.

Steps:
 Run Hudi DeltaStreamer job in --continuous mode
 Run the same job again without deleting the output parquet files generated due 
to step above
 2nd run crashes with the below error ( it does not crash if we delete the 
output parquet file)

{{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid 
schema provider class!
at 
org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1040) Support Spark3 compatibility

2020-06-23 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-1040:
--

 Summary: Support Spark3 compatibility
 Key: HUDI-1040
 URL: https://issues.apache.org/jira/browse/HUDI-1040
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Brandon Scheller


Update Hudi's usage of spark API's to remove deprecated spark APIs.

Allow Hudi to be built with both Spark2 and Spark3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-536) Update release notes to include KeyGenerator package changes

2020-01-14 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-536:
--
Description: 
The change introduced here:
 [https://github.com/apache/incubator-hudi/pull/1194]

Refactors hudi keygenerators into their own package.

We need to make this a backwards compatible change or update the release notes 
to address this.

Specifically:

org.apache.hudi.ComplexKeyGenerator -> 
org.apache.hudi.keygen.ComplexKeyGenerator

  was:
The change introduced here:
[https://github.com/apache/incubator-hudi/pull/1194]

Refactors hudi keygenerators into their own package.

We need to make this a backwards compatible change or update the release notes 
to address this.


> Update release notes to include KeyGenerator package changes
> 
>
> Key: HUDI-536
> URL: https://issues.apache.org/jira/browse/HUDI-536
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Brandon Scheller
>Priority: Major
>
> The change introduced here:
>  [https://github.com/apache/incubator-hudi/pull/1194]
> Refactors hudi keygenerators into their own package.
> We need to make this a backwards compatible change or update the release 
> notes to address this.
> Specifically:
> org.apache.hudi.ComplexKeyGenerator -> 
> org.apache.hudi.keygen.ComplexKeyGenerator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-536) Update release notes to include KeyGenerator package changes

2020-01-14 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-536:
-

 Summary: Update release notes to include KeyGenerator package 
changes
 Key: HUDI-536
 URL: https://issues.apache.org/jira/browse/HUDI-536
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Brandon Scheller


The change introduced here:
[https://github.com/apache/incubator-hudi/pull/1194]

Refactors hudi keygenerators into their own package.

We need to make this a backwards compatible change or update the release notes 
to address this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-520) Decide on keyGenerator strategy for handling null/empty recordkeys

2020-01-10 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-520:
-

 Summary: Decide on keyGenerator strategy for handling null/empty 
recordkeys 
 Key: HUDI-520
 URL: https://issues.apache.org/jira/browse/HUDI-520
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Brandon Scheller


Currently key-generator implementations write out "__null__" for null values 
and "__empty__" for empty in order to provide a distinction between the two. 
This can add extra overhead to large datalakes that might not use this 
distinction.

This Jira is to decide on a consistent strategy for handling null/empty record 
keys in key generators.

 

The current strategy can be seen within ComplexKeyGenerator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-389) Updates sent to diff partition for a given key with Global Index

2019-12-09 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992077#comment-16992077
 ] 

Brandon Scheller commented on HUDI-389:
---

[~vinoth] [~shivnarayan] 

I was working on a similar issue and I think this fix is a better long term 
solution to what I was trying to accomplish here: 
[https://github.com/apache/incubator-hudi/pull/1052] . I'd love to see this 
merged as this is needed for creating a proper delete-by-record-key only API 
with global index. I can act as a reviewer for this code.

> Updates sent to diff partition for a given key with Global Index 
> -
>
> Key: HUDI-389
> URL: https://issues.apache.org/jira/browse/HUDI-389
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Index
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> Updates sent to diff partition for a given key with Global Index should 
> succeed by updating the record under original partition. As of now, it throws 
> exception. 
> [https://github.com/apache/incubator-hudi/issues/1021] 
>  
>  
> error log:
> {code:java}
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.timeline.HoodieActiveTimeline - Loaded instants 
> java.util.stream.ReferencePipeline$Head@d02b1c7
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Building file 
> system view for partition (2016/04/15)
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - #files found 
> in partition (2016/04/15) =0, Time taken =0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - 
> addFilesToView: NumFiles=0, FileGroupsCreationTime=0, StoreTimeTaken=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.HoodieTableFileSystemView - Adding 
> file-groups for partition :2016/04/15, #FileGroups=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Time to load 
> partition (2016/04/15) =0
>  14754 [Executor task launch worker-0] ERROR 
> com.uber.hoodie.table.HoodieCopyOnWriteTable - Error upserting bucketType 
> UPDATE for partition :0
>  java.util.NoSuchElementException: No value present
>  at com.uber.hoodie.common.util.Option.get(Option.java:112)
>  at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
>  at 
> com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at 

[jira] [Created] (HUDI-327) Introduce "null" supporting ComplexKeyGenerator

2019-11-07 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-327:
-

 Summary: Introduce "null" supporting ComplexKeyGenerator
 Key: HUDI-327
 URL: https://issues.apache.org/jira/browse/HUDI-327
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Brandon Scheller


Customers have been running into issues where they would like to use a 
record_key from columns that can contain null values. Currently, this will 
cause Hudi to crash and throw a cryptic exception.(improving error messaging is 
a separate but related issue)

We would like to propose a new KeyGenerator based on ComplexKeyGenerator that 
allows for null record_keys.

At a basic level, using the key generator without any options would essentially 
allow a null record_key to be accepted. (It can be replaced with an empty 
string, null, or some predefined "null" string representation)

This comes with the negative side effect that all records with a null 
record_key would then be associated together. To work around this, you would be 
able to specify a secondary record_key to be used in the case that the first 
one is null. You would specify this in the same way that you do for the 
ComplexKeyGenerator as a comma separated list of record_keys. In this case, 
when the first key is seen as null then the second key will be used instead. We 
could support any arbitrary limit of record_keys here.

While we are aware there are many alternatives to avoid using a null 
record_key. We believe this will act as a usability improvement so that new 
users are not forced to clean/update their data in order to use Hudi.

We are hoping to get some feedback on the idea

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-327) Introduce "null" supporting KeyGenerator

2019-11-07 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-327:
--
Summary: Introduce "null" supporting KeyGenerator  (was: Introduce "null" 
supporting ComplexKeyGenerator)

> Introduce "null" supporting KeyGenerator
> 
>
> Key: HUDI-327
> URL: https://issues.apache.org/jira/browse/HUDI-327
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Brandon Scheller
>Priority: Major
>
> Customers have been running into issues where they would like to use a 
> record_key from columns that can contain null values. Currently, this will 
> cause Hudi to crash and throw a cryptic exception.(improving error messaging 
> is a separate but related issue)
> We would like to propose a new KeyGenerator based on ComplexKeyGenerator that 
> allows for null record_keys.
> At a basic level, using the key generator without any options would 
> essentially allow a null record_key to be accepted. (It can be replaced with 
> an empty string, null, or some predefined "null" string representation)
> This comes with the negative side effect that all records with a null 
> record_key would then be associated together. To work around this, you would 
> be able to specify a secondary record_key to be used in the case that the 
> first one is null. You would specify this in the same way that you do for the 
> ComplexKeyGenerator as a comma separated list of record_keys. In this case, 
> when the first key is seen as null then the second key will be used instead. 
> We could support any arbitrary limit of record_keys here.
> While we are aware there are many alternatives to avoid using a null 
> record_key. We believe this will act as a usability improvement so that new 
> users are not forced to clean/update their data in order to use Hudi.
> We are hoping to get some feedback on the idea
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-326) Support deleting records with only record_key

2019-11-07 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969684#comment-16969684
 ] 

Brandon Scheller commented on HUDI-326:
---

Additionally, does anyone have some context on how difficult this 
implementation would be? It seems like Hudi doesn't really have any way to 
track its own partitions, so it looks like we'd have to scan for all the 
partitions within the table if we wanted to implement something like this.

> Support deleting records with only record_key
> -
>
> Key: HUDI-326
> URL: https://issues.apache.org/jira/browse/HUDI-326
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Brandon Scheller
>Priority: Major
>
> Currently Hudi requires 3 things to issue a hard delete using 
> EmptyHoodieRecordPayload. It requires (record_key, partition_key, 
> precombine_key).
> This means that in many real use scenarios, you are required to issue a 
> select query to find the partition_key and possibly precombine_key for a 
> certain record before deleting it.
> We would like to avoid this extra step by being allowed to issue a delete 
> based on only the record_key of a record.
> This means that it would blanket delete all records with that specific 
> record_key across all partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-326) Support deleting records with only record_key

2019-11-07 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-326:
-

 Summary: Support deleting records with only record_key
 Key: HUDI-326
 URL: https://issues.apache.org/jira/browse/HUDI-326
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Brandon Scheller


Currently Hudi requires 3 things to issue a hard delete using 
EmptyHoodieRecordPayload. It requires (record_key, partition_key, 
precombine_key).

This means that in many real use scenarios, you are required to issue a select 
query to find the partition_key and possibly precombine_key for a certain 
record before deleting it.

We would like to avoid this extra step by being allowed to issue a delete based 
on only the record_key of a record.
This means that it would blanket delete all records with that specific 
record_key across all partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2019-10-28 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961530#comment-16961530
 ] 

Brandon Scheller commented on HUDI-305:
---

>From initial investigation it looks like this will require changes in Presto 
>to support the combination of custom file split + record reader.


A similar scenario as PrestoParquetRealtimeInputFormat where custom 
splits/reader are required is asked about here:

[https://groups.google.com/forum/#!topic/presto-users/oObR8FES_zI]

Looking at the code it looks like Presto cannot support the current Hudi use 
case without modification on the Presto side.

 

I was wondering what approach is currently being looked at?

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Major
> Fix For: 0.5.1
>
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> 

[jira] [Updated] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2019-10-16 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-305:
--
Environment: On AWS EMR

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Priority: Major
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2019-10-16 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-305:
-

 Summary: Presto MOR "_rt" queries only reads base parquet file 
 Key: HUDI-305
 URL: https://issues.apache.org/jira/browse/HUDI-305
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Brandon Scheller


Code example to reproduce.
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveMode

val df = Seq(
  ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_events_mor_1"
var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update a record with event_name "event_name_123" => "event_name_changed"
val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
val df2 = df1.filter($"event_id" === "104")
val df3 = df2.withColumn("event_name", lit("event_name_changed"))

// update hudi dataset
df3.write.format("org.apache.hudi")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option("hoodie.compact.inline", "false")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Now when querying the real-time table from Hive, we have no issue seeing the 
updated value:
{code:java}
hive> select event_name from hudi_events_mor_1_rt;
OK
event_name_900
event_name_changed
event_name_546
event_name_678
Time taken: 0.103 seconds, Fetched: 4 row(s)
{code}
But when querying the real-time table from Presto, we only read the base 
parquet file and do not see the update that should be merged in from the log 
file.
{code:java}
presto:default> select event_name from hudi_events_mor_1_rt;
   event_name

 event_name_900
 event_name_123
 event_name_546
 event_name_678
(4 rows)
{code}
Our current understanding of this issue is that while the 
HoodieParquetRealtimeInputFormat correctly generates the splits. The 
RealtimeCompactedRecordReader record reader is not used so it is not reading 
the log file and only reading the base parquet file.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)