[GitHub] [hudi] hughfdjackson commented on issue #2265: Arrays with nulls in them result in broken parquet files
hughfdjackson commented on issue #2265: URL: https://github.com/apache/hudi/issues/2265#issuecomment-754502750 @umehrot2 Thanks for looking into this - I'm taking a bit of hope from error message of the code you linked ;) ![image](https://user-images.githubusercontent.com/545689/103626620-5c474380-4f34-11eb-8ad9-a309b10b6d36.png) Have you had any further thoughts on this issue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-913) Update docs about KeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-913: -- Status: Open (was: New) > Update docs about KeyGenerator > -- > > Key: HUDI-913 > URL: https://issues.apache.org/jira/browse/HUDI-913 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > > update default values about `KeyGenerator` > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-913) Update docs about KeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-913: -- Fix Version/s: 0.7.0 > Update docs about KeyGenerator > -- > > Key: HUDI-913 > URL: https://issues.apache.org/jira/browse/HUDI-913 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > update default values about `KeyGenerator` > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-913) Update docs about KeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-913. - Resolution: Done > Update docs about KeyGenerator > -- > > Key: HUDI-913 > URL: https://issues.apache.org/jira/browse/HUDI-913 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > update default values about `KeyGenerator` > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers
codecov-io edited a comment on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report > Merging [#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (21792c6) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.19%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2374 +/- ## = - Coverage 50.23% 10.04% -40.20% + Complexity 2985 48 -2937 = Files 410 52 -358 Lines 18398 1852-16546 Branches 1884 223 -1661 = - Hits 9242 186 -9056 + Misses 8398 1653 -6745 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) |
[GitHub] [hudi] wangxianghu opened a new pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
wangxianghu opened a new pull request #2405: URL: https://github.com/apache/hudi/pull/2405 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1506) Fix wrong exception thrown in HoodieAvroUtils
[ https://issues.apache.org/jira/browse/HUDI-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1506: -- Description: {code:java} // Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) field not found in record. Acceptable fields were :[vin, uuid, commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, certificate, transAgency, transAgencyNet, transDateStart, transDateStop, insurCom, insurNum, insurType, insurCount, insurEff, insurExp, insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime] {code} we can see the `etlDatetime` do exist. it is caused by null value acturally was:Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) field not found in record. Acceptable fields were :[vin, uuid, commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, certificate, transAgency, transAgencyNet, transDateStart, transDateStop, insurCom, insurNum, insurType, insurCount, insurEff, insurExp, insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime] > Fix wrong exception thrown in HoodieAvroUtils > - > > Key: HUDI-1506 > URL: https://issues.apache.org/jira/browse/HUDI-1506 > Project: Apache Hudi > Issue Type: Bug >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > > {code:java} > // > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): > org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) > field not found in record. Acceptable fields were :[vin, uuid, > commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, > nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, > curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, > certificate, transAgency, transAgencyNet, transDateStart, transDateStop, > insurCom, insurNum, insurType, insurCount, insurEff, insurExp, > insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, > seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, > gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, > curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, > photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, > lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime] > {code} > we can see the `etlDatetime` do exist. it is caused by null value acturally -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
wangxianghu commented on pull request #2405: URL: https://github.com/apache/hudi/pull/2405#issuecomment-754644106 @yanghua please take a look when free This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #2404: [MINOR] Add Jira URL and Mailing List
wangxianghu commented on pull request #2404: URL: https://github.com/apache/hudi/pull/2404#issuecomment-754643863 @yanghua please take a look when free This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SureshK-T2S opened a new issue #2406: [SUPPORT] Deltastreamer - Property hoodie.datasource.write.partitionpath.field not found
SureshK-T2S opened a new issue #2406: URL: https://github.com/apache/hudi/issues/2406 I am attempting to create a hudi table using a parquet file on S3. The motivation for this approach is based on this Hudi blog: https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi To first attempt usage of deltastreamer to ingest a full initial batch load, I attempted to use parquet files used in an aws blog at s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/ https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/ At first I used the spark shell on EMR to load the data into a dataframe and view it, this happens with no issues: ![image](https://user-images.githubusercontent.com/17935082/103654795-59783d00-4f8c-11eb-86bd-a1f0cd7db3f3.png) I then attempted to use Hudi Deltastreamer as per my understanding of the documentation, however I ran into a couple of issues. Steps to reproduce the behavior: 1. Ran the following: ``` spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4\ --master yarn --deploy-mode client \ /usr/lib/hudi/hudi-utilities-bundle.jar --table-type MERGE_ON_READ \ --source-ordering-field request_timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3://mysqlcdc-stream-prod/hudi_tryout/hudi_aws_test --target-table hudi_aws_test \ --hoodie-conf hoodie.datasource.write.recordkey.field=request_timestamp,hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1,hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP ``` Stacktrace: ```Exception in thread "main" java.io.IOException: Could not load key generator class org.apache.hudi.keygen.SimpleKeyGenerator at org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:190) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:552) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:129) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:99) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:464) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:98) at org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:92) ... 17 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:87) ... 19 more Caused by: java.lang.IllegalArgumentException: Property hoodie.datasource.write.partitionpath.field not found at org.apache.hudi.common.config.TypedProperties.checkKey(TypedProperties.java:42) at org.apache.hudi.common.config.TypedProperties.getString(TypedProperties.java:47) at org.apache.hudi.keygen.SimpleKeyGenerator.(SimpleKeyGenerator.java:36) ... 24 more ```
[GitHub] [hudi] codecov-io commented on pull request #2404: [MINOR] Add Jira URL and Mailing List
codecov-io commented on pull request #2404: URL: https://github.com/apache/hudi/pull/2404#issuecomment-754638146 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=h1) Report > Merging [#2404](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=desc) (fdeb851) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.19%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2404/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2404 +/- ## = - Coverage 50.23% 10.04% -40.20% + Complexity 2985 48 -2937 = Files 410 52 -358 Lines 18398 1852-16546 Branches 1884 223 -1661 = - Hits 9242 186 -9056 + Misses 8398 1653 -6745 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00%
[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
codecov-io edited a comment on pull request #2379: URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=h1) Report > Merging [#2379](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=desc) (70ffbba) into [master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc) (6cdf59d) will **decrease** coverage by `42.58%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2379/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2379 +/- ## - Coverage 52.23% 9.65% -42.59% + Complexity 2662 48 -2614 Files 335 53 -282 Lines 149811927-13054 Branches 1506 231 -1275 - Hits 7825 186 -7639 + Misses 65331728 -4805 + Partials623 13 -610 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.65% <0.00%> (-60.01%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | | | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | ... and [312 more](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree-more)
[jira] [Created] (HUDI-1506) Fix wrong exception thrown in HoodieAvroUtils
wangxianghu created HUDI-1506: - Summary: Fix wrong exception thrown in HoodieAvroUtils Key: HUDI-1506 URL: https://issues.apache.org/jira/browse/HUDI-1506 Project: Apache Hudi Issue Type: Bug Reporter: wangxianghu Assignee: wangxianghu Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) field not found in record. Acceptable fields were :[vin, uuid, commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, certificate, transAgency, transAgencyNet, transDateStart, transDateStop, insurCom, insurNum, insurType, insurCount, insurEff, insurExp, insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1506) Fix wrong exception thrown in HoodieAvroUtils
[ https://issues.apache.org/jira/browse/HUDI-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1506: - Labels: pull-request-available (was: ) > Fix wrong exception thrown in HoodieAvroUtils > - > > Key: HUDI-1506 > URL: https://issues.apache.org/jira/browse/HUDI-1506 > Project: Apache Hudi > Issue Type: Bug >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > > {code:java} > // > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): > org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) > field not found in record. Acceptable fields were :[vin, uuid, > commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, > nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, > curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, > certificate, transAgency, transAgencyNet, transDateStart, transDateStop, > insurCom, insurNum, insurType, insurCount, insurEff, insurExp, > insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, > seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, > gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, > curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, > photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, > lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime] > {code} > we can see the `etlDatetime` do exist. it is caused by null value acturally -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu opened a new pull request #2404: [MINOR] Add Jira URL and Mailing List
wangxianghu opened a new pull request #2404: URL: https://github.com/apache/hudi/pull/2404 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua merged pull request #2403: [HUDI-913] Update docs about KeyGenerator
yanghua merged pull request #2403: URL: https://github.com/apache/hudi/pull/2403 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI-913] Update docs about KeyGenerator (#2403)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new ee00bd6 [HUDI-913] Update docs about KeyGenerator (#2403) ee00bd6 is described below commit ee00bd6689f151b7a6a1a197aced8699d373dd79 Author: wangxianghu AuthorDate: Tue Jan 5 18:21:08 2021 +0800 [HUDI-913] Update docs about KeyGenerator (#2403) --- docs/_docs/2_4_configurations.cn.md | 2 +- docs/_docs/2_4_configurations.md| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/_docs/2_4_configurations.cn.md b/docs/_docs/2_4_configurations.cn.md index 9436971..07e8c02 100644 --- a/docs/_docs/2_4_configurations.cn.md +++ b/docs/_docs/2_4_configurations.cn.md @@ -83,7 +83,7 @@ inputDF.write() 如果设置为true,则生成基于Hive格式的partition目录:= KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY} - 属性:`hoodie.datasource.write.keygenerator.class`, 默认值:`org.apache.hudi.SimpleKeyGenerator` + 属性:`hoodie.datasource.write.keygenerator.class`, 默认值:`org.apache.hudi.keygen.SimpleKeyGenerator` 键生成器类,实现从输入的`Row`对象中提取键 COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY} diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md index aa472dd..244688b 100644 --- a/docs/_docs/2_4_configurations.md +++ b/docs/_docs/2_4_configurations.md @@ -81,7 +81,7 @@ Actual value ontained by invoking .toString() When set to true, partition folder names follow the format of Hive partitions: = KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY} - Property: `hoodie.datasource.write.keygenerator.class`, Default: `org.apache.hudi.SimpleKeyGenerator` + Property: `hoodie.datasource.write.keygenerator.class`, Default: `org.apache.hudi.keygen.SimpleKeyGenerator` Key generator class, that implements will extract the key out of incoming `Row` object COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}
[GitHub] [hudi] liujinhui1994 closed pull request #2386: [HUDI-1160] Support update partial fields for CoW table
liujinhui1994 closed pull request #2386: URL: https://github.com/apache/hudi/pull/2386 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new f7ca68a Travis CI build asf-site f7ca68a is described below commit f7ca68af86ff254adf064bd4becb4aca02a001cf Author: CI AuthorDate: Tue Jan 5 12:20:10 2021 + Travis CI build asf-site --- content/cn/docs/configurations.html | 2 +- content/docs/configurations.html| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/cn/docs/configurations.html b/content/cn/docs/configurations.html index 1456a71..57aa993 100644 --- a/content/cn/docs/configurations.html +++ b/content/cn/docs/configurations.html @@ -450,7 +450,7 @@ 如果设置为true,则生成基于Hive格式的partition目录:=/span KEYGENERATOR_CLASS_OPT_KEY -属性:hoodie.datasource.write.keygenerator.class, 默认值:org.apache.hudi.SimpleKeyGenerator +属性:hoodie.datasource.write.keygenerator.class, 默认值:org.apache.hudi.keygen.SimpleKeyGenerator 键生成器类,实现从输入的Row对象中提取键 COMMIT_METADATA_KEYPREFIX_OPT_KEY diff --git a/content/docs/configurations.html b/content/docs/configurations.html index c274063..ef5f099 100644 --- a/content/docs/configurations.html +++ b/content/docs/configurations.html @@ -459,7 +459,7 @@ Actual value ontained by invoking .toString() When set to true, partition folder names follow the format of Hive partitions: =/span KEYGENERATOR_CLASS_OPT_KEY -Property: hoodie.datasource.write.keygenerator.class, Default: org.apache.hudi.SimpleKeyGenerator +Property: hoodie.datasource.write.keygenerator.class, Default: org.apache.hudi.keygen.SimpleKeyGenerator Key generator class, that implements will extract the key out of incoming Row object COMMIT_METADATA_KEYPREFIX_OPT_KEY
[GitHub] [hudi] codecov-io commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
codecov-io commented on pull request #2405: URL: https://github.com/apache/hudi/pull/2405#issuecomment-754665459 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=h1) Report > Merging [#2405](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=desc) (b51e61e) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.19%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2405/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2405 +/- ## = - Coverage 50.23% 10.04% -40.20% + Complexity 2985 48 -2937 = Files 410 52 -358 Lines 18398 1852-16546 Branches 1884 223 -1661 = - Hits 9242 186 -9056 + Misses 8398 1653 -6745 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00%
[GitHub] [hudi] vinothchandar commented on a change in pull request #2359: [HUDI-1486] Remove inflight rollback in hoodie writer
vinothchandar commented on a change in pull request #2359: URL: https://github.com/apache/hudi/pull/2359#discussion_r551618729 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -232,17 +250,18 @@ void emitCommitMetrics(String instantTime, HoodieCommitMetadata metadata, String * Main API to run bootstrap to hudi. */ public void bootstrap(Option> extraMetadata) { -if (rollbackPending) { - rollBackInflightBootstrap(); -} +// TODO : MULTIWRITER -> check if failed bootstrap files can be cleaned later HoodieTable table = getTableAndInitCtx(WriteOperationType.UPSERT, HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS); +if (isEager(config.getFailedWritesCleanPolicy())) { Review comment: `config` is a member variable right? why do we pass it in to the checks? Can we just do `eagerCleanFailedWrites()`, which does he if block and the call to `rollbackFailedBootstrap()`? ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieClient.java ## @@ -70,6 +72,8 @@ protected AbstractHoodieClient(HoodieEngineContext context, HoodieWriteConfig cl this.config = clientConfig; this.timelineServer = timelineServer; shouldStopTimelineServer = !timelineServer.isPresent(); +this.heartbeatClient = new HoodieHeartbeatClient(this.fs, this.basePath, +clientConfig.getHoodieClientHeartbeatIntervalInMs(), clientConfig.getHoodieClientHeartbeatTolerableMisses()); Review comment: rename to be shorter? drop the `HoodieClient` or `Hoodie` part in the names? ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -707,24 +739,51 @@ public void rollbackInflightCompaction(HoodieInstant inflightInstant, HoodieTabl } /** - * Cleanup all pending commits. + * Rollback all failed commits. */ - private void rollbackPendingCommits() { + public void rollbackFailedCommits() { HoodieTable table = createTable(config, hadoopConf); -HoodieTimeline inflightTimeline = table.getMetaClient().getCommitsTimeline().filterPendingExcludingCompaction(); -List commits = inflightTimeline.getReverseOrderedInstants().map(HoodieInstant::getTimestamp) -.collect(Collectors.toList()); -for (String commit : commits) { - if (HoodieTimeline.compareTimestamps(commit, HoodieTimeline.LESSER_THAN_OR_EQUALS, +List instantsToRollback = getInstantsToRollback(table); +for (String instant : instantsToRollback) { + if (HoodieTimeline.compareTimestamps(instant, HoodieTimeline.LESSER_THAN_OR_EQUALS, HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS)) { -rollBackInflightBootstrap(); +rollbackFailedBootstrap(); break; } else { -rollback(commit); +rollback(instant); + } + // Delete the heartbeats from DFS + if (!isEager(config.getFailedWritesCleanPolicy())) { +try { + this.heartbeatClient.delete(instant); +} catch (IOException io) { + LOG.error(io); +} } } } + private List getInstantsToRollback(HoodieTable table) { +if (isEager(config.getFailedWritesCleanPolicy())) { + HoodieTimeline inflightTimeline = table.getMetaClient().getCommitsTimeline().filterPendingExcludingCompaction(); + return inflightTimeline.getReverseOrderedInstants().map(HoodieInstant::getTimestamp) + .collect(Collectors.toList()); +} else if (!isEager(config.getFailedWritesCleanPolicy())) { + return table.getMetaClient().getActiveTimeline() + .getCommitsTimeline().filterInflights().getReverseOrderedInstants().filter(instant -> { +try { + return heartbeatClient.isHeartbeatExpired(instant.getTimestamp(), System.currentTimeMillis()); Review comment: drop the second argument to `isHeartbeatExpired()` and have it get the current time epoch inside? ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -707,24 +739,51 @@ public void rollbackInflightCompaction(HoodieInstant inflightInstant, HoodieTabl } /** - * Cleanup all pending commits. + * Rollback all failed commits. */ - private void rollbackPendingCommits() { + public void rollbackFailedCommits() { HoodieTable table = createTable(config, hadoopConf); -HoodieTimeline inflightTimeline = table.getMetaClient().getCommitsTimeline().filterPendingExcludingCompaction(); -List commits = inflightTimeline.getReverseOrderedInstants().map(HoodieInstant::getTimestamp) -.collect(Collectors.toList()); -for (String commit : commits) { - if (HoodieTimeline.compareTimestamps(commit, HoodieTimeline.LESSER_THAN_OR_EQUALS, +List instantsToRollback = getInstantsToRollback(table); +
[jira] [Updated] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()
[ https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1479: - Description: *Change #1* {code:java} public static List getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, boolean assumeDatePartitioning) throws IOException { if (assumeDatePartitioning) { return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); } else { HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata, verifyListings, false, false); return tableMetadata.getAllPartitionPaths(); } } {code} is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways *Change #2* On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well. *Change #3* There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata` !image-2021-01-05-10-00-35-187.png! was: - Can be done once rfc-15 is merged into master. - We will also use HoodieEngineContext and perform parallelized listing as needed. - HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code - Chase down direct usages of FileSystemBackedTableMetadata and parallelize them as well > Replace FSUtils.getAllPartitionPaths() with > HoodieTableMetadata#getAllPartitionPaths() > -- > > Key: HUDI-1479 > URL: https://issues.apache.org/jira/browse/HUDI-1479 > Project: Apache Hudi > Issue Type: Sub-task > Components: Code Cleanup >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 0.7.0 > > Attachments: image-2021-01-05-10-00-35-187.png > > > *Change #1* > {code:java} > public static List getAllPartitionPaths(FileSystem fs, String > basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, > boolean > assumeDatePartitioning) throws IOException { > if (assumeDatePartitioning) { > return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); > } else { > HoodieTableMetadata tableMetadata = > HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", > useFileListingFromMetadata, > verifyListings, false, false); > return tableMetadata.getAllPartitionPaths(); > } > } > {code} > is the current implementation, where `HoodieTableMetadata.create()` always > creates `HoodieBackedTableMetadata`. Instead we should create > `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways > *Change #2* > On master, we have the `HoodieEngineContext` abstraction, which allows for > parallel execution. We should consider moving it to `hudi-common` (its > doable) and then have `FileSystemBackedTableMetadata` redone such that it can > do parallelized listings using the passed in engine. either > HoodieSparkEngineContext or HoodieJavaEngineContext. > HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized > code. We should take one pass and see if that can be redone a bit as well. > > *Change #3* > There are places, where we call fs.listStatus() directly. We should make them > go through the HoodieTable.getMetadata()... route as well. Essentially, all > listing should be concentrated to `FileSystemBackedTableMetadata` > !image-2021-01-05-10-00-35-187.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()
[ https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1479: - Attachment: image-2021-01-05-10-00-35-187.png > Replace FSUtils.getAllPartitionPaths() with > HoodieTableMetadata#getAllPartitionPaths() > -- > > Key: HUDI-1479 > URL: https://issues.apache.org/jira/browse/HUDI-1479 > Project: Apache Hudi > Issue Type: Sub-task > Components: Code Cleanup >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 0.7.0 > > Attachments: image-2021-01-05-10-00-35-187.png > > > - Can be done once rfc-15 is merged into master. > - We will also use HoodieEngineContext and perform parallelized listing as > needed. > - HoodieBackedTableMetadata#getPartitionsToFilesMapping has some > parallelized code > - Chase down direct usages of FileSystemBackedTableMetadata and parallelize > them as well -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan commented on a change in pull request #2400: [WIP] Some fixes to test suite framework. Adding clustering node
nsivabalan commented on a change in pull request #2400: URL: https://github.com/apache/hudi/pull/2400#discussion_r552072132 ## File path: docker/demo/config/test-suite/complex-dag-cow.yaml ## @@ -14,41 +14,47 @@ # See the License for the specific language governing permissions and # limitations under the License. dag_name: cow-long-running-example.yaml -dag_rounds: 2 +dag_rounds: 1 Review comment: ignore reviewing these two files for now: complex-dag-cow.yaml and complex-dag-mor.yaml I am yet to fix these. ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/ValidateDatasetNode.java ## @@ -111,17 +111,19 @@ public void execute(ExecutionContext context) throws Exception { throw new AssertionError("Hudi contents does not match contents input data. "); } -String database = context.getWriterContext().getProps().getString(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY()); -String tableName = context.getWriterContext().getProps().getString(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY()); -log.warn("Validating hive table with db : " + database + " and table : " + tableName); -Dataset cowDf = session.sql("SELECT * FROM " + database + "." + tableName); -Dataset trimmedCowDf = cowDf.drop(HoodieRecord.COMMIT_TIME_METADATA_FIELD).drop(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD).drop(HoodieRecord.RECORD_KEY_METADATA_FIELD) - .drop(HoodieRecord.PARTITION_PATH_METADATA_FIELD).drop(HoodieRecord.FILENAME_METADATA_FIELD); -intersectionDf = inputSnapshotDf.intersect(trimmedDf); -// the intersected df should be same as inputDf. if not, there is some mismatch. -if (inputSnapshotDf.except(intersectionDf).count() != 0) { - log.error("Data set validation failed for COW hive table. Total count in hudi " + trimmedCowDf.count() + ", input df count " + inputSnapshotDf.count()); - throw new AssertionError("Hudi hive table contents does not match contents input data. "); +if (config.isValidateHive()) { Review comment: no changes except adding this if condition ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/FlexibleSchemaRecordGenerationIterator.java ## @@ -41,21 +46,17 @@ private GenericRecord lastRecord; // Partition path field name private Set partitionPathFieldNames; - private String firstPartitionPathField; public FlexibleSchemaRecordGenerationIterator(long maxEntriesToProduce, String schema) { this(maxEntriesToProduce, GenericRecordFullPayloadGenerator.DEFAULT_PAYLOAD_SIZE, schema, null, 0); } public FlexibleSchemaRecordGenerationIterator(long maxEntriesToProduce, int minPayloadSize, String schemaStr, - List partitionPathFieldNames, int numPartitions) { + List partitionPathFieldNames, int partitionIndex) { Review comment: all changes in this file are part of reverting e33a8f733c4a9a94479c166ad13ae9d53142cd3f ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java ## @@ -127,9 +127,13 @@ public DeltaGenerator(DFSDeltaConfig deltaOutputConfig, JavaSparkContext jsc, Sp return ws; } + public int getBatchId() { +return this.batchId; + } + public JavaRDD generateInserts(Config operation) { int numPartitions = operation.getNumInsertPartitions(); -long recordsPerPartition = operation.getNumRecordsInsert(); +long recordsPerPartition = operation.getNumRecordsInsert() / numPartitions; Review comment: these are part of reverting e33a8f733c4a9a94479c166ad13ae9d53142cd3f ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java ## @@ -140,7 +144,7 @@ public DeltaGenerator(DFSDeltaConfig deltaOutputConfig, JavaSparkContext jsc, Sp JavaRDD inputBatch = jsc.parallelize(partitionIndexes, numPartitions) .mapPartitionsWithIndex((index, p) -> { return new LazyRecordGeneratorIterator(new FlexibleSchemaRecordGenerationIterator(recordsPerPartition, - minPayloadSize, schemaStr, partitionPathFieldNames, numPartitions)); +minPayloadSize, schemaStr, partitionPathFieldNames, (Integer)index)); Review comment: these are part of reverting e33a8f733c4a9a94479c166ad13ae9d53142cd3f ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java ## @@ -46,9 +46,9 @@ */ public class GenericRecordFullPayloadGenerator implements Serializable { - private static final Logger LOG = LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class); + private static Logger LOG = LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class); Review comment: all changes in this file are part of reverting e33a8f733c4a9a94479c166ad13ae9d53142cd3f ## File path:
[GitHub] [hudi] nsivabalan commented on pull request #2400: Some fixes and enhancements to test suite framework
nsivabalan commented on pull request #2400: URL: https://github.com/apache/hudi/pull/2400#issuecomment-754812328 @n3nash : Patch is ready for review. @satishkotha : I have added clustering node. Do check it out. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1459) Support for handling of REPLACE instants
[ https://issues.apache.org/jira/browse/HUDI-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259074#comment-17259074 ] Vinoth Chandar commented on HUDI-1459: -- [~pwason] [~satishkotha] several users reporting this when trying use SaveMode.Overwrite on spark data source and turning on metadata table {code} org.apache.hudi.exception.HoodieException: Unknown type of action replacecommit at org.apache.hudi.metadata.HoodieTableMetadataUtil.convertInstantToMetaRecords(HoodieTableMetadataUtil.java:96) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.syncFromInstants(HoodieBackedTableMetadataWriter.java:372) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:120) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:62) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:58) at org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:406) at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:417) at org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:189) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:108) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:439) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:224) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:129) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696) at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94) at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141) at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249) ... 57 elided {code} > Support for handling of REPLACE instants > > > Key: HUDI-1459 > URL: https://issues.apache.org/jira/browse/HUDI-1459 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Vinoth Chandar >Assignee: Prashant Wason >Priority: Blocker > Fix For: 0.7.0 > > > Once we rebase to master, we need to handle replace instants as well, as they > show up on the timeline. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()
[ https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1479: - Description: *Change #1* {code:java} public static List getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, boolean assumeDatePartitioning) throws IOException { if (assumeDatePartitioning) { return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); } else { HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata, verifyListings, false, false); return tableMetadata.getAllPartitionPaths(); } } {code} is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. This helps address https://github.com/apache/hudi/pull/2398/files#r550709687 *Change #2* On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well. Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216 *Change #3* There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata` !image-2021-01-05-10-00-35-187.png! was: *Change #1* {code:java} public static List getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, boolean assumeDatePartitioning) throws IOException { if (assumeDatePartitioning) { return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); } else { HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata, verifyListings, false, false); return tableMetadata.getAllPartitionPaths(); } } {code} is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways *Change #2* On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well. *Change #3* There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata` !image-2021-01-05-10-00-35-187.png! > Replace FSUtils.getAllPartitionPaths() with > HoodieTableMetadata#getAllPartitionPaths() > -- > > Key: HUDI-1479 > URL: https://issues.apache.org/jira/browse/HUDI-1479 > Project: Apache Hudi > Issue Type: Sub-task > Components: Code Cleanup >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 0.7.0 > > Attachments: image-2021-01-05-10-00-35-187.png > > > *Change #1* > {code:java} > public static List getAllPartitionPaths(FileSystem fs, String > basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, > boolean > assumeDatePartitioning) throws IOException { > if (assumeDatePartitioning) { > return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); > } else { > HoodieTableMetadata tableMetadata = > HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", > useFileListingFromMetadata, > verifyListings, false, false); > return tableMetadata.getAllPartitionPaths(); > } > } > {code} > is the current implementation, where `HoodieTableMetadata.create()` always > creates `HoodieBackedTableMetadata`.
[jira] [Commented] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259072#comment-17259072 ] Vinoth Chandar commented on HUDI-1308: -- More testing on S3 from [~vbalaji] {code} Caused by: org.apache.hudi.exception.HoodieIOException: getFileStatus on s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test5_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool {code} > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Fix For: 0.7.0 > > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()
[ https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259090#comment-17259090 ] Vinoth Chandar commented on HUDI-1479: -- [~uditme] I have updated the description with detailed steps to do. I think this will be a super impactful change. let me know if you are able to pick this up. > Replace FSUtils.getAllPartitionPaths() with > HoodieTableMetadata#getAllPartitionPaths() > -- > > Key: HUDI-1479 > URL: https://issues.apache.org/jira/browse/HUDI-1479 > Project: Apache Hudi > Issue Type: Sub-task > Components: Code Cleanup >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 0.7.0 > > Attachments: image-2021-01-05-10-00-35-187.png > > > *Change #1* > {code:java} > public static List getAllPartitionPaths(FileSystem fs, String > basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, > boolean > assumeDatePartitioning) throws IOException { > if (assumeDatePartitioning) { > return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); > } else { > HoodieTableMetadata tableMetadata = > HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", > useFileListingFromMetadata, > verifyListings, false, false); > return tableMetadata.getAllPartitionPaths(); > } > } > {code} > is the current implementation, where `HoodieTableMetadata.create()` always > creates `HoodieBackedTableMetadata`. Instead we should create > `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. > This helps address https://github.com/apache/hudi/pull/2398/files#r550709687 > *Change #2* > On master, we have the `HoodieEngineContext` abstraction, which allows for > parallel execution. We should consider moving it to `hudi-common` (its > doable) and then have `FileSystemBackedTableMetadata` redone such that it can > do parallelized listings using the passed in engine. either > HoodieSparkEngineContext or HoodieJavaEngineContext. > HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized > code. We should take one pass and see if that can be redone a bit as well. > Food for thought: > https://github.com/apache/hudi/pull/2398#discussion_r550711216 > > *Change #3* > There are places, where we call fs.listStatus() directly. We should make them > go through the HoodieTable.getMetadata()... route as well. Essentially, all > listing should be concentrated to `FileSystemBackedTableMetadata` > !image-2021-01-05-10-00-35-187.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
codecov-io edited a comment on pull request #2379: URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
yanghua commented on a change in pull request #2405: URL: https://github.com/apache/hudi/pull/2405#discussion_r551988234 ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -428,10 +429,14 @@ public static Object getNestedFieldVal(GenericRecord record, String fieldName, b if (returnNullIfNotFound) { return null; -} else { +} + +if (recordSchema.getField(parts[i]) == null) { Review comment: Can it be `else if`? ## File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java ## @@ -428,10 +429,14 @@ public static Object getNestedFieldVal(GenericRecord record, String fieldName, b if (returnNullIfNotFound) { return null; -} else { +} + +if (recordSchema.getField(parts[i]) == null) { throw new HoodieException( fieldName + "(Part -" + parts[i] + ") field not found in record. Acceptable fields were :" + valueNode.getSchema().getFields().stream().map(Field::name).collect(Collectors.toList())); Review comment: Can we reuse `recordSchema ` here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
codecov-io edited a comment on pull request #2379: URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=h1) Report > Merging [#2379](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=desc) (4cc4b35) into [master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc) (6cdf59d) will **decrease** coverage by `42.58%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2379/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2379 +/- ## - Coverage 52.23% 9.65% -42.59% + Complexity 2662 48 -2614 Files 335 53 -282 Lines 149811927-13054 Branches 1506 231 -1275 - Hits 7825 186 -7639 + Misses 65331728 -4805 + Partials623 13 -610 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.65% <0.00%> (-60.01%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | | | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | ... and [312 more](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree-more)
[GitHub] [hudi] afilipchik commented on a change in pull request #2380: [Hudi 73] Adding support for vanilla AvroKafkaSource
afilipchik commented on a change in pull request #2380: URL: https://github.com/apache/hudi/pull/2380#discussion_r552037619 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java ## @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.serde; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.utilities.UtilHelpers; +import org.apache.hudi.utilities.schema.SchemaProvider; +import org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig; + +import kafka.utils.VerifiableProperties; +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericDatumReader; +import org.apache.avro.io.DatumReader; +import org.apache.avro.io.DecoderFactory; +import org.apache.avro.specific.SpecificDatumReader; +import org.apache.kafka.common.errors.SerializationException; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.Map; +import java.util.Objects; +import java.util.Properties; + +import static org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig.SCHEMA_PROVIDER_CLASS_PROP; + +public class AbstractHoodieKafkaAvroDeserializer { + + private final DecoderFactory decoderFactory = DecoderFactory.get(); + private boolean useSpecificAvroReader = false; + private Schema sourceSchema; + private Schema targetSchema; + + public AbstractHoodieKafkaAvroDeserializer(VerifiableProperties properties) { +// this.sourceSchema = new Schema.Parser().parse(properties.props().getProperty(FilebasedSchemaProvider.Config.SOURCE_SCHEMA_PROP)); +TypedProperties typedProperties = new TypedProperties(); +copyProperties(typedProperties, properties.props()); +try { + SchemaProvider schemaProvider = UtilHelpers.createSchemaProvider( + typedProperties.getString(SCHEMA_PROVIDER_CLASS_PROP), typedProperties, null); + this.sourceSchema = Objects.requireNonNull(schemaProvider).getSourceSchema(); + this.targetSchema = Objects.requireNonNull(schemaProvider).getTargetSchema(); +} catch (IOException e) { + throw new RuntimeException(e); +} + } + + private void copyProperties(TypedProperties typedProperties, Properties properties) { +for (Map.Entry entry : properties.entrySet()) { + typedProperties.put(entry.getKey(), entry.getValue()); +} + } + + protected void configure(HoodieKafkaAvroDeserializationConfig config) { +useSpecificAvroReader = config + .getBoolean(HoodieKafkaAvroDeserializationConfig.SPECIFIC_AVRO_READER_CONFIG); + } + + protected Object deserialize(byte[] payload) throws SerializationException { +return deserialize(null, null, payload, targetSchema); + } + + /** + * Just like single-parameter version but accepts an Avro schema to use for reading. + * + * @param payload serialized data + * @param readerSchema schema to use for Avro read (optional, enables Avro projection) + * @return the deserialized object + */ + protected Object deserialize(byte[] payload, Schema readerSchema) throws SerializationException { +return deserialize(null, null, payload, readerSchema); + } + + protected Object deserialize(String topic, Boolean isKey, byte[] payload, Schema readerSchema) { Review comment: It should be coming from the configured schema provider. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1507) Hive sync having issues w/ Clustering
sivabalan narayanan created HUDI-1507: - Summary: Hive sync having issues w/ Clustering Key: HUDI-1507 URL: https://issues.apache.org/jira/browse/HUDI-1507 Project: Apache Hudi Issue Type: Bug Components: Storage Management Affects Versions: 0.7.0 Reporter: sivabalan narayanan I was trying out clustering w/ test suite job and ran into hive sync issues. 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 5522853c-653b-4d92-acf4-d299c263a77f 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at instant time :20210105164505 clustering strategy org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy, clustering sort cols : _row_key, target partitions for clustering :: 0, inline cluster max commit : 1 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 20210105164505 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"} 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing org.apache.hudi.exception.HoodieIOException: unknown action in timeline replacecommit at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102) at org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50) at org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589) at org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53) at org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] afilipchik commented on a change in pull request #2380: [Hudi 73] Adding support for vanilla AvroKafkaSource
afilipchik commented on a change in pull request #2380: URL: https://github.com/apache/hudi/pull/2380#discussion_r552040233 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java ## @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.serde; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.utilities.UtilHelpers; +import org.apache.hudi.utilities.schema.SchemaProvider; +import org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig; + +import kafka.utils.VerifiableProperties; +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericDatumReader; +import org.apache.avro.io.DatumReader; +import org.apache.avro.io.DecoderFactory; +import org.apache.avro.specific.SpecificDatumReader; +import org.apache.kafka.common.errors.SerializationException; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.Map; +import java.util.Objects; +import java.util.Properties; + +import static org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig.SCHEMA_PROVIDER_CLASS_PROP; + +public class AbstractHoodieKafkaAvroDeserializer { + + private final DecoderFactory decoderFactory = DecoderFactory.get(); + private boolean useSpecificAvroReader = false; + private Schema sourceSchema; + private Schema targetSchema; + + public AbstractHoodieKafkaAvroDeserializer(VerifiableProperties properties) { +// this.sourceSchema = new Schema.Parser().parse(properties.props().getProperty(FilebasedSchemaProvider.Config.SOURCE_SCHEMA_PROP)); +TypedProperties typedProperties = new TypedProperties(); +copyProperties(typedProperties, properties.props()); +try { + SchemaProvider schemaProvider = UtilHelpers.createSchemaProvider( + typedProperties.getString(SCHEMA_PROVIDER_CLASS_PROP), typedProperties, null); + this.sourceSchema = Objects.requireNonNull(schemaProvider).getSourceSchema(); + this.targetSchema = Objects.requireNonNull(schemaProvider).getTargetSchema(); +} catch (IOException e) { + throw new RuntimeException(e); +} + } + + private void copyProperties(TypedProperties typedProperties, Properties properties) { +for (Map.Entry entry : properties.entrySet()) { + typedProperties.put(entry.getKey(), entry.getValue()); +} + } + + protected void configure(HoodieKafkaAvroDeserializationConfig config) { +useSpecificAvroReader = config + .getBoolean(HoodieKafkaAvroDeserializationConfig.SPECIFIC_AVRO_READER_CONFIG); + } + + protected Object deserialize(byte[] payload) throws SerializationException { +return deserialize(null, null, payload, targetSchema); + } + + /** + * Just like single-parameter version but accepts an Avro schema to use for reading. + * + * @param payload serialized data + * @param readerSchema schema to use for Avro read (optional, enables Avro projection) + * @return the deserialized object + */ + protected Object deserialize(byte[] payload, Schema readerSchema) throws SerializationException { +return deserialize(null, null, payload, readerSchema); + } + + protected Object deserialize(String topic, Boolean isKey, byte[] payload, Schema readerSchema) { +try { + ByteBuffer buffer = this.getByteBuffer(payload); + int id = buffer.getInt(); Review comment: this assumes the message starts with schema version (code looks like from Confluent deserializer). It doesn't belong to AbstractHoodieKafkaAvroDeserializer.java This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2402: [HUDI-1383] Fixing sorting of partition vals for hive sync computation
nsivabalan commented on a change in pull request #2402: URL: https://github.com/apache/hudi/pull/2402#discussion_r552046190 ## File path: hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java ## @@ -21,10 +21,10 @@ import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.testutils.SchemaTestUtil; import org.apache.hudi.common.util.Option; -import org.apache.hudi.sync.common.AbstractSyncHoodieClient.PartitionEvent; Review comment: I did the usual reformatting from within intellij. I didn't explicitly fix any of these. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1507) Hive sync having issues w/ Clustering
[ https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259040#comment-17259040 ] sivabalan narayanan commented on HUDI-1507: --- CC : [~satish] > Hive sync having issues w/ Clustering > - > > Key: HUDI-1507 > URL: https://issues.apache.org/jira/browse/HUDI-1507 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Priority: Major > > I was trying out clustering w/ test suite job and ran into hive sync issues. > > 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node > 5522853c-653b-4d92-acf4-d299c263a77f > 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at > instant time :20210105164505 clustering strategy > org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy, > clustering sort cols : _row_key, target partitions for clustering :: 0, > inline cluster max commit : 1 > 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: > 20210105164505 > 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: > \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"} > 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing > org.apache.hudi.exception.HoodieIOException: unknown action in timeline > replacecommit > at > org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99) > at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50) > at > org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589) > at > org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53) > at > org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r551984532 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.avro.Schema; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.TableSchemaResolver; +import org.apache.hudi.common.util.Option; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +public class HoodieClusteringJob { + + private static final Logger LOG = LogManager.getLogger(HoodieClusteringJob.class); + private final Config cfg; + private transient FileSystem fs; + private TypedProperties props; + private final JavaSparkContext jsc; + + public HoodieClusteringJob(JavaSparkContext jsc, Config cfg) { +this.cfg = cfg; +this.jsc = jsc; +this.props = cfg.propsFilePath == null +? UtilHelpers.buildProperties(cfg.configs) +: readConfigFromFileSystem(jsc, cfg); + } + + private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, Config cfg) { +final FileSystem fs = FSUtils.getFs(cfg.basePath, jsc.hadoopConfiguration()); + +return UtilHelpers +.readConfig(fs, new Path(cfg.propsFilePath), cfg.configs) +.getConfig(); + } + + public static class Config implements Serializable { +@Parameter(names = {"--base-path", "-sp"}, description = "Base path for the table", required = true) +public String basePath = null; +@Parameter(names = {"--table-name", "-tn"}, description = "Table name", required = true) +public String tableName = null; +@Parameter(names = {"--instant-time", "-it"}, description = "Clustering Instant time", required = true) +public String clusteringInstantTime = null; +@Parameter(names = {"--parallelism", "-pl"}, description = "Parallelism for hoodie insert", required = false) +public int parallelism = 1; +@Parameter(names = {"--spark-master", "-ms"}, description = "Spark master", required = false) +public String sparkMaster = null; +@Parameter(names = {"--spark-memory", "-sm"}, description = "spark memory to use", required = true) +public String sparkMemory = null; +@Parameter(names = {"--retry", "-rt"}, description = "number of retries", required = false) +public int retry = 0; + +@Parameter(names = {"--schedule", "-sc"}, description = "Schedule clustering") +public Boolean runSchedule = false; +@Parameter(names = {"--help", "-h"}, help = true) +public Boolean help = false; + +@Parameter(names = {"--props"}, description = "path to properties file on localfs or dfs, with configurations for " ++ "hoodie client for clustering") +public String propsFilePath = null; + +@Parameter(names = {"--hoodie-conf"}, description = "Any configuration that can be set in the properties file " ++ "(using the CLI parameter \"--props\") can also be passed command line using this parameter. This can be repeated", +splitter = IdentitySplitter.class) +public List configs = new ArrayList<>(); + } + + public static void main(String[] args) { +final Config cfg = new Config(); +JCommander cmd = new JCommander(cfg, null, args); +if (cfg.help || args.length == 0) { + cmd.usage(); + System.exit(1); +} +final JavaSparkContext jsc = UtilHelpers.buildSparkContext("clustering-" + cfg.tableName, cfg.sparkMaster, cfg.sparkMemory); +HoodieClusteringJob clusteringJob = new HoodieClusteringJob(jsc, cfg);
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r551984722 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.avro.Schema; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.TableSchemaResolver; +import org.apache.hudi.common.util.Option; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +public class HoodieClusteringJob { + + private static final Logger LOG = LogManager.getLogger(HoodieClusteringJob.class); + private final Config cfg; + private transient FileSystem fs; + private TypedProperties props; + private final JavaSparkContext jsc; + + public HoodieClusteringJob(JavaSparkContext jsc, Config cfg) { +this.cfg = cfg; +this.jsc = jsc; +this.props = cfg.propsFilePath == null +? UtilHelpers.buildProperties(cfg.configs) +: readConfigFromFileSystem(jsc, cfg); + } + + private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, Config cfg) { +final FileSystem fs = FSUtils.getFs(cfg.basePath, jsc.hadoopConfiguration()); + +return UtilHelpers +.readConfig(fs, new Path(cfg.propsFilePath), cfg.configs) +.getConfig(); + } + + public static class Config implements Serializable { +@Parameter(names = {"--base-path", "-sp"}, description = "Base path for the table", required = true) +public String basePath = null; +@Parameter(names = {"--table-name", "-tn"}, description = "Table name", required = true) +public String tableName = null; +@Parameter(names = {"--instant-time", "-it"}, description = "Clustering Instant time", required = true) Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r551984380 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.avro.Schema; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.TableSchemaResolver; +import org.apache.hudi.common.util.Option; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +public class HoodieClusteringJob { + + private static final Logger LOG = LogManager.getLogger(HoodieClusteringJob.class); + private final Config cfg; + private transient FileSystem fs; + private TypedProperties props; + private final JavaSparkContext jsc; + + public HoodieClusteringJob(JavaSparkContext jsc, Config cfg) { +this.cfg = cfg; +this.jsc = jsc; +this.props = cfg.propsFilePath == null +? UtilHelpers.buildProperties(cfg.configs) +: readConfigFromFileSystem(jsc, cfg); + } + + private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, Config cfg) { +final FileSystem fs = FSUtils.getFs(cfg.basePath, jsc.hadoopConfiguration()); + +return UtilHelpers +.readConfig(fs, new Path(cfg.propsFilePath), cfg.configs) +.getConfig(); + } + + public static class Config implements Serializable { +@Parameter(names = {"--base-path", "-sp"}, description = "Base path for the table", required = true) +public String basePath = null; +@Parameter(names = {"--table-name", "-tn"}, description = "Table name", required = true) +public String tableName = null; +@Parameter(names = {"--instant-time", "-it"}, description = "Clustering Instant time", required = true) +public String clusteringInstantTime = null; +@Parameter(names = {"--parallelism", "-pl"}, description = "Parallelism for hoodie insert", required = false) +public int parallelism = 1; +@Parameter(names = {"--spark-master", "-ms"}, description = "Spark master", required = false) +public String sparkMaster = null; +@Parameter(names = {"--spark-memory", "-sm"}, description = "spark memory to use", required = true) +public String sparkMemory = null; +@Parameter(names = {"--retry", "-rt"}, description = "number of retries", required = false) +public int retry = 0; + +@Parameter(names = {"--schedule", "-sc"}, description = "Schedule clustering") +public Boolean runSchedule = false; +@Parameter(names = {"--help", "-h"}, help = true) +public Boolean help = false; + +@Parameter(names = {"--props"}, description = "path to properties file on localfs or dfs, with configurations for " ++ "hoodie client for clustering") +public String propsFilePath = null; + +@Parameter(names = {"--hoodie-conf"}, description = "Any configuration that can be set in the properties file " ++ "(using the CLI parameter \"--props\") can also be passed command line using this parameter. This can be repeated", +splitter = IdentitySplitter.class) +public List configs = new ArrayList<>(); + } + + public static void main(String[] args) { +final Config cfg = new Config(); +JCommander cmd = new JCommander(cfg, null, args); +if (cfg.help || args.length == 0) { + cmd.usage(); + System.exit(1); +} +final JavaSparkContext jsc = UtilHelpers.buildSparkContext("clustering-" + cfg.tableName, cfg.sparkMaster, cfg.sparkMemory); +HoodieClusteringJob clusteringJob = new HoodieClusteringJob(jsc, cfg);
[GitHub] [hudi] yanghua commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
yanghua commented on pull request #2405: URL: https://github.com/apache/hudi/pull/2405#issuecomment-754694548 @wangxianghu And Travis failed, please check what's wrong... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2402: [HUDI-1383] Fixing sorting of partition vals for hive sync computation
nsivabalan commented on a change in pull request #2402: URL: https://github.com/apache/hudi/pull/2402#discussion_r55204 ## File path: hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java ## @@ -56,7 +56,7 @@ } private static Iterable useJdbcAndSchemaFromCommitMetadata() { -return Arrays.asList(new Object[][] { { true, true }, { true, false }, { false, true }, { false, false } }); +return Arrays.asList(new Object[][] {{true, true}, {true, false}, {false, true}, {false, false}}); Review comment: @n3nash : same comment as above. Can you go to this file in one of your branch and apply reformatting and let me know if you see similar changes. If not, I will looking into my reformatting is diff. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r551986565 ## File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java ## @@ -682,6 +693,58 @@ public void testInlineClustering() throws Exception { }); } + private HoodieClusteringJob.Config buildHoodieClusteringUtilConfig(String basePath, + String clusteringInstantTime, boolean runSchedule) { +HoodieClusteringJob.Config config = new HoodieClusteringJob.Config(); +config.basePath = basePath; +config.clusteringInstantTime = clusteringInstantTime; +config.runSchedule = runSchedule; +config.propsFilePath = dfsBasePath + "/clusteringjob.properties"; +return config; + } + + @Test + public void testHoodieAsyncClusteringJob() throws Exception { +String tableBasePath = dfsBasePath + "/asyncClustering"; +// Keep it higher than batch-size to test continuous mode +int totalRecords = 3000; + +// Initial bulk insert +HoodieDeltaStreamer.Config cfg = TestHelpers.makeConfig(tableBasePath, WriteOperationType.UPSERT); +cfg.continuousMode = true; +cfg.tableType = HoodieTableType.COPY_ON_WRITE.name(); +cfg.configs.add(String.format("%s=%d", SourceConfigs.MAX_UNIQUE_RECORDS_PROP, totalRecords)); +cfg.configs.add(String.format("%s=false", HoodieCompactionConfig.AUTO_CLEAN_PROP)); +cfg.configs.add(String.format("%s=true", HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE_OPT_KEY)); +HoodieDeltaStreamer ds = new HoodieDeltaStreamer(cfg, jsc); +deltaStreamerTestRunner(ds, cfg, (r) -> { + TestHelpers.assertAtLeastNCommits(2, tableBasePath, dfs); + // for not confiict with delta streamer commit, just add 3600s + String clusterInstantTime = HoodieActiveTimeline.COMMIT_FORMATTER + .format(new Date(System.currentTimeMillis() + 3600 * 1000)); + LOG.info("Cluster instant time " + clusterInstantTime); + HoodieClusteringJob.Config scheduleClusteringConfig = buildHoodieClusteringUtilConfig(tableBasePath, + clusterInstantTime, true); + HoodieClusteringJob scheduleClusteringJob = new HoodieClusteringJob(jsc, scheduleClusteringConfig); + int scheduleClusteringResult = scheduleClusteringJob.cluster(scheduleClusteringConfig.retry); + if (scheduleClusteringResult == 0) { +LOG.info("Schedule clustering success, now cluster"); +HoodieClusteringJob.Config clusterClusteringConfig = buildHoodieClusteringUtilConfig(tableBasePath, +clusterInstantTime, false); +HoodieClusteringJob clusterClusteringJob = new HoodieClusteringJob(jsc, clusterClusteringConfig); +clusterClusteringJob.cluster(clusterClusteringConfig.retry); +LOG.info("Cluster success"); + } else { +LOG.warn("Schedule clustering failed"); + } + HoodieTableMetaClient metaClient = new HoodieTableMetaClient(this.dfs.getConf(), tableBasePath, true); + int pendingReplaceSize = metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants().toArray().length; + int completeReplaceSize = metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().toArray().length; + System.out.println("PendingReplaceSize=" + pendingReplaceSize + ",completeReplaceSize = " + completeReplaceSize); + return completeReplaceSize > 0; Review comment: Because if always completeReplaceSize <= 0 the runner will throw time out exception.Now i add the assert for completeReplaceSize == 1. As the unit test mainly test async clustering schedule and cluster, just assert completeReplaceSize will be ok. For records check can cover in cluster unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2404: [MINOR] Add Jira URL and Mailing List
yanghua commented on pull request #2404: URL: https://github.com/apache/hudi/pull/2404#issuecomment-754682021 @vinothchandar Do you agree with this change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] afilipchik commented on pull request #2380: [Hudi 73] Adding support for vanilla AvroKafkaSource
afilipchik commented on pull request #2380: URL: https://github.com/apache/hudi/pull/2380#issuecomment-754744538 On making AbstractHoodieKafkaAvroDeserializer abstract - it looks like modified Confluent deserializer, so it believe it should be called like that. If we want to support Confluent schema registry we need to use schema id to acquire writer's schema, otherwise schema evolutions will be a pain. I.E. to deserialize an avro message we need 2 things: schema it was written with (can come from properties, but with Confluent id in the beginning of the message tells you exact version and can be used to fetch it from the schema registry) and the reader schema (schema on the reader side which comes from the schema provider) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1508) Partition update with global index in MOR tables resulting in duplicate values during read optimized queries
Ryan Pifer created HUDI-1508: Summary: Partition update with global index in MOR tables resulting in duplicate values during read optimized queries Key: HUDI-1508 URL: https://issues.apache.org/jira/browse/HUDI-1508 Project: Apache Hudi Issue Type: Bug Reporter: Ryan Pifer The way Hudi handles updating partition path is by locating the existing record and performing a delete on the previous partition and performing insert on new partition. In the case of Merge-on-Read tables the delete operation, and any update operation, is added as a log file. However since an insert occurs in the new partition the record is added in a parquet file. Querying using `QUERY_TYPE_READ_OPTIMIZED_OPT_VAL` fetches only parquet files and now we have the case where 2 records for given primary key are present -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1507) Hive sync having issues w/ Clustering
[ https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish reassigned HUDI-1507: Assignee: satish > Hive sync having issues w/ Clustering > - > > Key: HUDI-1507 > URL: https://issues.apache.org/jira/browse/HUDI-1507 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Assignee: satish >Priority: Major > > I was trying out clustering w/ test suite job and ran into hive sync issues. > > 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node > 5522853c-653b-4d92-acf4-d299c263a77f > 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at > instant time :20210105164505 clustering strategy > org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy, > clustering sort cols : _row_key, target partitions for clustering :: 0, > inline cluster max commit : 1 > 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: > 20210105164505 > 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: > \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"} > 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing > org.apache.hudi.exception.HoodieIOException: unknown action in timeline > replacecommit > at > org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99) > at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50) > at > org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589) > at > org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53) > at > org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1399: - Status: Patch Available (was: In Progress) > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2400: Some fixes and enhancements to test suite framework
codecov-io edited a comment on pull request #2400: URL: https://github.com/apache/hudi/pull/2400#issuecomment-753557036 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=h1) Report > Merging [#2400](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=desc) (ab40bd6) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.20%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2400/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2400 +/- ## = - Coverage 50.23% 10.03% -40.21% + Complexity 2985 48 -2937 = Files 410 52 -358 Lines 18398 1854-16544 Branches 1884 224 -1660 = - Hits 9242 186 -9056 + Misses 8398 1655 -6743 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.03% <0.00%> (-59.63%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | | | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%>
[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering
[ https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1507: - Labels: pull-request-available (was: ) > Hive sync having issues w/ Clustering > - > > Key: HUDI-1507 > URL: https://issues.apache.org/jira/browse/HUDI-1507 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Assignee: satish >Priority: Major > Labels: pull-request-available > > I was trying out clustering w/ test suite job and ran into hive sync issues. > > 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node > 5522853c-653b-4d92-acf4-d299c263a77f > 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at > instant time :20210105164505 clustering strategy > org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy, > clustering sort cols : _row_key, target partitions for clustering :: 0, > inline cluster max commit : 1 > 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: > 20210105164505 > 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: > \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"} > 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing > org.apache.hudi.exception.HoodieIOException: unknown action in timeline > replacecommit > at > org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99) > at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50) > at > org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589) > at > org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53) > at > org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] satishkotha opened a new pull request #2407: [HUDI-1507] Change timeline utils to support reading replacecommit
satishkotha opened a new pull request #2407: URL: https://github.com/apache/hudi/pull/2407 ## What is the purpose of the pull request Change timeline utils to support reading replacecommit metadata ## Brief change log HiveSync uses TimelineUtils to get modified partitions. Add support for replacecommit in TimelineUtils. ## Verify this pull request This change added tests ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] WTa-hash commented on issue #2229: [SUPPORT] UpsertPartitioner performance
WTa-hash commented on issue #2229: URL: https://github.com/apache/hudi/issues/2229#issuecomment-754894794 @bvaradar - I would like to understand a little bit more about what's going on here with the spark stage "Getting small files from partitions" from the screenshot. ![img](https://user-images.githubusercontent.com/64644025/103697708-e956c080-4f65-11eb-9fdf-1e36d2165d5e.PNG) In the executor logs, I see the following: `2021-01-05 20:26:52,176 INFO [dispatcher-event-loop-1] org.apache.spark.executor.CoarseGrainedExecutorBackend:Got assigned task 4686 2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] org.apache.spark.executor.Executor:Running task 0.0 in stage 701.0 (TID 4686) 2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Updating epoch to 202 and clearing cache 2021-01-05 20:26:52,177 INFO [Executor task launch worker for task 4686] org.apache.spark.broadcast.TorrentBroadcast:Started reading broadcast variable 502 2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block broadcast_502_piece0 stored as bytes in memory (estimated size 196.8 KB, free 4.3 GB) 2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] org.apache.spark.broadcast.TorrentBroadcast:Reading broadcast variable 502 took 1 ms 2021-01-05 20:26:52,180 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block broadcast_502 stored as values in memory (estimated size 637.1 KB, free 4.3 GB) 2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Don't have map outputs for shuffle 201, fetching them 2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@ip-xxx-xx-xxx-xxx:35039) 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Got the output locations 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.ShuffleBlockFetcherIterator:Getting 18 non-empty blocks including 6 local blocks and 12 remote blocks 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.ShuffleBlockFetcherIterator:Started 2 remote fetches in 0 ms 2021-01-05 20:26:53,287 INFO [Executor task launch worker for task 4686] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false s3://...table/.hoodie/.temp/20210105202638/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet.marker.MERGE 2021-01-05 20:26:53,384 INFO [pool-22-thread-1] com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem:Opening 's3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-659-4407_20210105202508.parquet' for reading 2021-01-05 20:27:40,841 INFO [Executor task launch worker for task 4686] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false s3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet 2021-01-05 20:27:41,708 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block rdd_2026_0 stored as values in memory (estimated size 390.0 B, free 4.3 GB) 2021-01-05 20:27:41,713 INFO [Executor task launch worker for task 4686] org.apache.spark.executor.Executor:Finished task 0.0 in stage 701.0 (TID 4686). 3169 bytes result sent to driver` Does this mean the 50 seconds for this task is used to create a SINGLE new parquet file with the new data using the "small parquet file" as its base? If my thoughts are correct here, is there a configuration to split the records evenly into each executor so that you can have multiple writes occurring in parallel? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] WTa-hash edited a comment on issue #2229: [SUPPORT] UpsertPartitioner performance
WTa-hash edited a comment on issue #2229: URL: https://github.com/apache/hudi/issues/2229#issuecomment-754894794 @bvaradar - I would like to understand a little bit more about what's going on here with the spark stage "Getting small files from partitions" from the screenshot. ![img](https://user-images.githubusercontent.com/64644025/103697708-e956c080-4f65-11eb-9fdf-1e36d2165d5e.PNG) In the executor logs, I see the following: ` 2021-01-05 20:26:52,176 INFO [dispatcher-event-loop-1] org.apache.spark.executor.CoarseGrainedExecutorBackend:Got assigned task 4686 2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] org.apache.spark.executor.Executor:Running task 0.0 in stage 701.0 (TID 4686) 2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Updating epoch to 202 and clearing cache 2021-01-05 20:26:52,177 INFO [Executor task launch worker for task 4686] org.apache.spark.broadcast.TorrentBroadcast:Started reading broadcast variable 502 2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block broadcast_502_piece0 stored as bytes in memory (estimated size 196.8 KB, free 4.3 GB) 2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] org.apache.spark.broadcast.TorrentBroadcast:Reading broadcast variable 502 took 1 ms 2021-01-05 20:26:52,180 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block broadcast_502 stored as values in memory (estimated size 637.1 KB, free 4.3 GB) 2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Don't have map outputs for shuffle 201, fetching them 2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@ip-xxx-xx-xxx-xxx:35039) 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Got the output locations 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.ShuffleBlockFetcherIterator:Getting 18 non-empty blocks including 6 local blocks and 12 remote blocks 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.ShuffleBlockFetcherIterator:Started 2 remote fetches in 0 ms 2021-01-05 20:26:53,287 INFO [Executor task launch worker for task 4686] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false s3://...table/.hoodie/.temp/20210105202638/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet.marker.MERGE 2021-01-05 20:26:53,384 INFO [pool-22-thread-1] com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem:Opening 's3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-659-4407_20210105202508.parquet' for reading 2021-01-05 20:27:40,841 INFO [Executor task launch worker for task 4686] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false s3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet 2021-01-05 20:27:41,708 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block rdd_2026_0 stored as values in memory (estimated size 390.0 B, free 4.3 GB) 2021-01-05 20:27:41,713 INFO [Executor task launch worker for task 4686] org.apache.spark.executor.Executor:Finished task 0.0 in stage 701.0 (TID 4686). 3169 bytes result sent to driver ` Does this mean the 50 seconds for this task is used to create a SINGLE new parquet file with the new data using the "small parquet file" as its base? If my thoughts are correct here, is there a configuration to split the records evenly into each executor so that you can have multiple writes occurring in parallel? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
satishkotha commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r552133598 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -109,6 +111,9 @@ public static void main(String[] args) { if (result == -1) { LOG.error(resultMsg + " failed"); } else { + if (cfg.runSchedule) { +System.out.println("The schedule instant time is " + cfg.clusteringInstantTime); Review comment: Can you change this to a LOG message? ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -153,7 +161,12 @@ private int doSchedule(JavaSparkContext jsc) throws Exception { String schemaStr = getSchemaFromLatestInstant(); SparkRDDWriteClient client = UtilHelpers.createHoodieClient(jsc, cfg.basePath, schemaStr, cfg.parallelism, Option.empty(), props); -return client.scheduleClusteringAtInstant(cfg.clusteringInstantTime, Option.empty()) ? 0 : -1; +Option instantTime = client.scheduleClustering(Option.empty()); +if (instantTime.isPresent()) { + cfg.clusteringInstantTime = instantTime.get(); Review comment: changing config at this stage seems a little awkward. Do you think its better to return Option instantTime from this method? Or maybe just add the log line here with instantTime? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] WTa-hash edited a comment on issue #2229: [SUPPORT] UpsertPartitioner performance
WTa-hash edited a comment on issue #2229: URL: https://github.com/apache/hudi/issues/2229#issuecomment-754894794 @bvaradar - I would like to understand a little bit more about what's going on here with the spark stage "Getting small files from partitions" from the screenshot. ![img](https://user-images.githubusercontent.com/64644025/103697708-e956c080-4f65-11eb-9fdf-1e36d2165d5e.PNG) In the executor logs, I see the following: `2021-01-05 20:26:52,176 INFO [dispatcher-event-loop-1] org.apache.spark.executor.CoarseGrainedExecutorBackend:Got assigned task 4686 2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] org.apache.spark.executor.Executor:Running task 0.0 in stage 701.0 (TID 4686) 2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Updating epoch to 202 and clearing cache 2021-01-05 20:26:52,177 INFO [Executor task launch worker for task 4686] org.apache.spark.broadcast.TorrentBroadcast:Started reading broadcast variable 502 2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block broadcast_502_piece0 stored as bytes in memory (estimated size 196.8 KB, free 4.3 GB) 2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] org.apache.spark.broadcast.TorrentBroadcast:Reading broadcast variable 502 took 1 ms 2021-01-05 20:26:52,180 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block broadcast_502 stored as values in memory (estimated size 637.1 KB, free 4.3 GB) 2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Don't have map outputs for shuffle 201, fetching them 2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@ip-xxx-xx-xxx-xxx:35039) 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.MapOutputTrackerWorker:Got the output locations 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.ShuffleBlockFetcherIterator:Getting 18 non-empty blocks including 6 local blocks and 12 remote blocks 2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.ShuffleBlockFetcherIterator:Started 2 remote fetches in 0 ms 2021-01-05 20:26:53,287 INFO [Executor task launch worker for task 4686] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false s3://...table/.hoodie/.temp/20210105202638/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet.marker.MERGE 2021-01-05 20:26:53,384 INFO [pool-22-thread-1] com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem:Opening 's3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-659-4407_20210105202508.parquet' for reading 2021-01-05 20:27:40,841 INFO [Executor task launch worker for task 4686] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false s3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet 2021-01-05 20:27:41,708 INFO [Executor task launch worker for task 4686] org.apache.spark.storage.memory.MemoryStore:Block rdd_2026_0 stored as values in memory (estimated size 390.0 B, free 4.3 GB) 2021-01-05 20:27:41,713 INFO [Executor task launch worker for task 4686] org.apache.spark.executor.Executor:Finished task 0.0 in stage 701.0 (TID 4686). 3169 bytes result sent to driver ` Does this mean the 50 seconds for this task is used to create a SINGLE new parquet file with the new data using the "small parquet file" as its base? If my thoughts are correct here, is there a configuration to split the records evenly into each executor so that you can have multiple writes occurring in parallel? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1509: - Description: During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit [[HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] I wrote a unit test to reduce the scope of testing as follows: # Take an existing parquet file from production dataset (size=690MB, #records=960K) # Read all the records from this parquet into a JavaRDD # Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1) The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset. The time to complete the bulk insert prepped *decreased from 680seconds to 380seconds* when I reverted the above commit. Schema details: This HUDI dataset uses a large schema with 51 fields in the record. was: During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit [[HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] I wrote a unit test to reduce the scope of testing as follows: # Take an existing parquet file from production dataset (size=690MB, #records=960K) # Read all the records from this parquet into a JavaRDD # Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1) The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset. The time to complete the bulk insert prepped *decreased from 680seconds to 380seconds* when I reverted the above commit. > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323 ] Prashant Wason commented on HUDI-1509: -- I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); * // 75usec average for this line* *// 200usec average for the lines below * for (Schema.Field f : newSchema.getFields()) { if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } *// 3usec for the code above* *// 75usec for the code below* if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1509) Major performance degradation due to rewriting records with default values
Prashant Wason created HUDI-1509: Summary: Major performance degradation due to rewriting records with default values Key: HUDI-1509 URL: https://issues.apache.org/jira/browse/HUDI-1509 Project: Apache Hudi Issue Type: Bug Reporter: Prashant Wason During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a performance degradation for writes into HUDI. I have traced the issue due to the changes in the following commit [[HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] I wrote a unit test to reduce the scope of testing as follows: # Take an existing parquet file from production dataset (size=690MB, #records=960K) # Read all the records from this parquet into a JavaRDD # Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1) The above scenario is directly taken from our production pipelines where each executor will ingest about a million record creating a single parquet file in a COW dataset. This is bulk insert only dataset. The time to complete the bulk insert prepped *decreased from 680seconds to 380seconds* when I reverted the above commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323 ] Prashant Wason edited comment on HUDI-1509 at 1/6/21, 12:52 AM: I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); ** // 75usec average for this line** * *// 200usec average for the lines below ** for (Schema.Field f : newSchema.getFields()) Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } *// 3usec for the code above* *// 75usec for the code below* if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } was (Author: pwason): I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); * // 75usec average for this line* *// 200usec average for the lines below * for (Schema.Field f : newSchema.getFields()) { if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } *// 3usec for the code above* *// 75usec for the code below* if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323 ] Prashant Wason edited comment on HUDI-1509 at 1/6/21, 12:52 AM: I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); {color:#FF}*** // 75usec average for this line***{color} {color:#FF}**// 200usec average for the lines below ***{color} for (Schema.Field f : newSchema.getFields()) Unknown macro: { if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } {color:#FF}*// 3usec for the code above*{color} {color:#FF}*// 75usec for the code below*{color} if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } was (Author: pwason): I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); ** // 75usec average for this line** *// 200usec average for the lines below ** for (Schema.Field f : newSchema.getFields()) Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } *// 3usec for the code above* *// 75usec for the code below* if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323 ] Prashant Wason edited comment on HUDI-1509 at 1/6/21, 12:52 AM: I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); ** // 75usec average for this line** *// 200usec average for the lines below ** for (Schema.Field f : newSchema.getFields()) Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } *// 3usec for the code above* *// 75usec for the code below* if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } was (Author: pwason): I timed the various code fragments involved in the above commit. The timings are as follows: private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, Schema newSchema) { LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); ** // 75usec average for this line** * *// 200usec average for the lines below ** for (Schema.Field f : newSchema.getFields()) Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { allFields.add(f); } } return allFields; } private static GenericRecord rewrite(GenericRecord record, LinkedHashSet fieldsToWrite, Schema newSchema) { GenericRecord newRecord = new GenericData.Record(newSchema); for (Schema.Field f : fieldsToWrite) { if (record.get(f.name()) == null) { if (f.defaultVal() instanceof JsonProperties.Null) { newRecord.put(f.name(), null); } else { newRecord.put(f.name(), f.defaultVal()); } } else { newRecord.put(f.name(), record.get(f.name())); } } *// 3usec for the code above* *// 75usec for the code below* if (!GenericData.get().validate(newSchema, newRecord)) { throw new SchemaCompatibilityException( "Unable to validate the rewritten record " + record + " against schema " + newSchema); } return newRecord; } > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io commented on pull request #2407: [HUDI-1507] Change timeline utils to support reading replacecommit
codecov-io commented on pull request #2407: URL: https://github.com/apache/hudi/pull/2407#issuecomment-754918776 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=h1) Report > Merging [#2407](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=desc) (88ff431) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.19%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2407/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2407 +/- ## = - Coverage 50.23% 10.04% -40.20% + Complexity 2985 48 -2937 = Files 410 52 -358 Lines 18398 1852-16546 Branches 1884 223 -1661 = - Hits 9242 186 -9056 + Misses 8398 1653 -6745 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | |
[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering
[ https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1507: - Labels: pull-request-available release-blocker (was: release-blocker) > Hive sync having issues w/ Clustering > - > > Key: HUDI-1507 > URL: https://issues.apache.org/jira/browse/HUDI-1507 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Assignee: satish >Priority: Major > Labels: pull-request-available, release-blocker > > I was trying out clustering w/ test suite job and ran into hive sync issues. > > 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node > 5522853c-653b-4d92-acf4-d299c263a77f > 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at > instant time :20210105164505 clustering strategy > org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy, > clustering sort cols : _row_key, target partitions for clustering :: 0, > inline cluster max commit : 1 > 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: > 20210105164505 > 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: > \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"} > 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing > org.apache.hudi.exception.HoodieIOException: unknown action in timeline > replacecommit > at > org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99) > at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50) > at > org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589) > at > org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53) > at > org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering
[ https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1507: -- Labels: release-blocker (was: pull-request-available) > Hive sync having issues w/ Clustering > - > > Key: HUDI-1507 > URL: https://issues.apache.org/jira/browse/HUDI-1507 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Assignee: satish >Priority: Major > Labels: release-blocker > > I was trying out clustering w/ test suite job and ran into hive sync issues. > > 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node > 5522853c-653b-4d92-acf4-d299c263a77f > 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at > instant time :20210105164505 clustering strategy > org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy, > clustering sort cols : _row_key, target partitions for clustering :: 0, > inline cluster max commit : 1 > 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: > 20210105164505 > 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: > \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"} > 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing > org.apache.hudi.exception.HoodieIOException: unknown action in timeline > replacecommit > at > org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99) > at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102) > at > org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50) > at > org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589) > at > org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53) > at > org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) > at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] jtmzheng opened a new issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset
jtmzheng opened a new issue #2408: URL: https://github.com/apache/hudi/issues/2408 **Describe the problem you faced** We have a Spark Streaming application running on EMR 5.31.0 that reads from a Kinesis stream (batch interval of 30 minutes) and upserts to a MOR dataset that is partitioned by date. This dataset is ~ 2.2 TB in size and ~ 6 billion records. Our problem is the application is now persistently crashing on an OutOfMemory error regardless of the batch input size (stacktrace attached below is for an input of ~ 1 million records and size ~ 250 mb). For debugging we've tried replacing the Hudi upsert with a simple count and afterwards there seems to be minimal memory usage by the application based on the Spark UI. ``` df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(OUTPUT_PATH)``` This seems similar to https://github.com/apache/hudi/issues/1491#issuecomment-610626104 since it seems to be running out of memory on reading old records based on the stacktrace. There are a lot of large log files in the dataset: ``` ... 2021-01-01 00:28:08 910487106 hudi/production/transactions_v9/2020/12/30/.0ef05182-e0ad-44b5-b52d-0412a6b44a01-1_20210101033018.log.1_3774-34-62086 2021-01-01 21:03:26 910490317 hudi/production/transactions_v9/2020/12/30/.0ef05182-e0ad-44b5-b52d-0412a6b44a01-1_20210101033018.log.11_3774-34-62109 2021-01-01 16:52:39 910495970 hudi/production/transactions_v9/2020/12/30/.0ef05182-e0ad-44b5-b52d-0412a6b44a01-1_20210101033018.log.9_3774-34-62083 ``` Our Hudi configs are: ``` hudi_options = { "hoodie.table.name": "transactions", "hoodie.datasource.write.recordkey.field": "id.value", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "year,month,day", "hoodie.datasource.write.table.name": "transactions", "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.datasource.write.operation": "upsert", "hoodie.consistency.check.enabled": "true", "hoodie.datasource.write.precombine.field": "publishedAtUnixNano", "hoodie.compact.inline": True, "hoodie.compact.inline.max.delta.commits": 10, "hoodie.cleaner.commits.retained": 1, } ``` Our Spark configs are (CORES_PER_EXECUTOR = 5, NUM_EXECUTORS = 30) for a cluster running on 10 r5.4xlarge instances: ``` "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.sql.hive.convertMetastoreParquet": "false", # Recommended from https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/ "spark.executor.cores": CORES_PER_EXECUTOR, "spark.executor.memory": "36g", "spark.yarn.executor.memoryOverhead": "4g", "spark.driver.cores": CORES_PER_EXECUTOR, "spark.driver.memory": "36g", "spark.executor.instances": NUM_EXECUTORS, "spark.default.parallelism": NUM_EXECUTORS * CORES_PER_EXECUTOR * 2, "spark.dynamicAllocation.enabled": "false", "spark.streaming.dynamicAllocation.enabled": "false", "spark.streaming.backpressure.enabled": "true", # Set max rate limit per receiver to limit memory usage "spark.streaming.receiver.maxRate": "10", # Shutdown gracefully on JVM shutdown "spark.streaming.stopGracefullyOnShutdown": "true", # GC options (https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html) "spark.executor.defaultJavaOptions": "-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12", "spark.driver.defaultJavaOptions": "-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12", "spark.streaming.kinesis.retry.waitTime": "1000ms", # default is 100 ms "spark.streaming.kinesis.retry.maxAttempts": "10", # default is 10 # Set max retries on S3 rate limits "spark.hadoop.fs.s3.maxRetries": "20", ``` **Environment Description** * Hudi version : 0.6.0 * Spark version : 2.4.6 (EMR 5.31.0) * Hive version : Hive 2.3.7 * Hadoop version : Amazon 2.10.0 * Storage (HDFS/S3/GCS..) : S3 * Running
[jira] [Updated] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1509: - Fix Version/s: 0.7.0 > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > Fix For: 0.7.0 > > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259324#comment-17259324 ] Prashant Wason commented on HUDI-1509: -- So calling getCombinedFieldsToWrite() is adding 275usec for writing each record to ParquetWriter which takes approximately 200usec or so for the write call. Everything else remaining same this nearly double the time to complete the ingestion. > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Priority: Blocker > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1509: - Affects Version/s: 0.7.0 0.6.1 0.6.0 > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.6.0, 0.6.1, 0.7.0 >Reporter: Prashant Wason >Priority: Blocker > Fix For: 0.7.0 > > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values
[ https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259332#comment-17259332 ] Nishith Agarwal commented on HUDI-1509: --- [~pwason] Thanks for digging into this and instrumenting the code in details. So the difference in the hash look up of `record.get(f.name())` vs `{ if (!allFields.contains(f) && !isMetadataField(f.name()))` is 197 usec ? > Major performance degradation due to rewriting records with default values > -- > > Key: HUDI-1509 > URL: https://issues.apache.org/jira/browse/HUDI-1509 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.6.0, 0.6.1, 0.7.0 >Reporter: Prashant Wason >Priority: Blocker > Fix For: 0.7.0 > > > During the in-house testing for 0.5x to 0.6x release upgrade, I have detected > a performance degradation for writes into HUDI. I have traced the issue due > to the changes in the following commit > [[HUDI-727]: Copy default values of fields if not present when rewriting > incoming record with new > schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253] > I wrote a unit test to reduce the scope of testing as follows: > # Take an existing parquet file from production dataset (size=690MB, > #records=960K) > # Read all the records from this parquet into a JavaRDD > # Time the call HoodieWriteClient.bulkInsertPrepped(). > (bulkInsertParallelism=1) > The above scenario is directly taken from our production pipelines where each > executor will ingest about a million record creating a single parquet file in > a COW dataset. This is bulk insert only dataset. > The time to complete the bulk insert prepped *decreased from 680seconds to > 380seconds* when I reverted the above commit. > Schema details: This HUDI dataset uses a large schema with 51 fields in the > record. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r552329734 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -153,7 +161,12 @@ private int doSchedule(JavaSparkContext jsc) throws Exception { String schemaStr = getSchemaFromLatestInstant(); SparkRDDWriteClient client = UtilHelpers.createHoodieClient(jsc, cfg.basePath, schemaStr, cfg.parallelism, Option.empty(), props); -return client.scheduleClusteringAtInstant(cfg.clusteringInstantTime, Option.empty()) ? 0 : -1; +Option instantTime = client.scheduleClustering(Option.empty()); +if (instantTime.isPresent()) { + cfg.clusteringInstantTime = instantTime.get(); Review comment: done, also add a method for test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow opened a new issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow opened a new issue #2409: URL: https://github.com/apache/hudi/issues/2409 Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables , no errors happening **Environment Description** * Hudi version :0.6.0 * Spark version : 2.4.4 * Hive version : 2.3.7 * Hadoop version : 2.7.5 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no code as follows: batchDF.write.format("org.apache.hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "MERGE_ON_READ") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert") .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, "10") .option("hoodie.datasource.compaction.async.enable", "true") .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rec_id") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "modified") .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "ads") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, hiveTableName) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dt") .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") .option(HoodieWriteConfig.TABLE_NAME, hiveTableName) .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://0.0.0.0:1") .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, "") .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, "") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[MultiPartKeysValueExtractor].getName) .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name()) .option("hoodie.insert.shuffle.parallelism", "10") .option("hoodie.upsert.shuffle.parallelism", "10") .mode("append") .save("/data/mor/user") only create user_ro , no user_rt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
yanghua commented on pull request #2405: URL: https://github.com/apache/hudi/pull/2405#issuecomment-754997547 @wangxianghu Please check Travis again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r552329605 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -109,6 +111,9 @@ public static void main(String[] args) { if (result == -1) { LOG.error(resultMsg + " failed"); } else { + if (cfg.runSchedule) { +System.out.println("The schedule instant time is " + cfg.clusteringInstantTime); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
lw309637554 commented on a change in pull request #2379: URL: https://github.com/apache/hudi/pull/2379#discussion_r552318635 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ## @@ -109,6 +111,9 @@ public static void main(String[] args) { if (result == -1) { LOG.error(resultMsg + " failed"); } else { + if (cfg.runSchedule) { +System.out.println("The schedule instant time is " + cfg.clusteringInstantTime); Review comment: I think system.out.println will be better. Because println the instant time use system.out.println will output the string to spark stdout of UI. If use LOG by default the string will output to stderr with many other logs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client
yanghua commented on pull request #2375: URL: https://github.com/apache/hudi/pull/2375#issuecomment-755056830 > > > Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT? > > > > > > @Nieal-Yang sorry about the late reply. IMO The batch mode will work but not fully taking the advantage of flink. But we can optimize this step by step. To ensure this PR will work, would you add some unit tests to this PR? > > okay. This index has been used in production in my company. We will optimize it gradually Hi @Nieal-Yang Would you please tell me what's the Flink version you are using? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ivorzhou commented on pull request #2091: HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table
ivorzhou commented on pull request #2091: URL: https://github.com/apache/hudi/pull/2091#issuecomment-755121964 > @ivorzhou : is the requirement to set default value or value from previous version of the record? if previous version of the record, then guess we already have another PR for this #2106 I want to update record with specified field like SQL update. UPDATE col='value' where id=1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1510) Move HoodieEngineContext to hudi-common module
[ https://issues.apache.org/jira/browse/HUDI-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HUDI-1510: Component/s: (was: Writer Core) (was: Common Core) Code Cleanup Fix Version/s: 0.7.0 > Move HoodieEngineContext to hudi-common module > -- > > Key: HUDI-1510 > URL: https://issues.apache.org/jira/browse/HUDI-1510 > Project: Apache Hudi > Issue Type: Bug > Components: Code Cleanup >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Fix For: 0.7.0 > > > We want to parallelize file system listings in Hudi, which after RFC-15 > implementation will be happening through class *FileSystemBackedMetadata* > which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, > we want to move HoodieEngineContext to hudi-common to be able to parallelize > the file/partition listing APIs using the engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] umehrot2 opened a new pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common
umehrot2 opened a new pull request #2410: URL: https://github.com/apache/hudi/pull/2410 ## What is the purpose of the pull request Moves HoodieEngineContext class and its dependencies to hudi-common, so that we can parallelize fetching of files and partitions in FileSystemBackedTableMetadata and FSUtils. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1510) Move HoodieEngineContext to hudi-common module
[ https://issues.apache.org/jira/browse/HUDI-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1510: - Labels: pull-request-available (was: ) > Move HoodieEngineContext to hudi-common module > -- > > Key: HUDI-1510 > URL: https://issues.apache.org/jira/browse/HUDI-1510 > Project: Apache Hudi > Issue Type: Bug > Components: Code Cleanup >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > We want to parallelize file system listings in Hudi, which after RFC-15 > implementation will be happening through class *FileSystemBackedMetadata* > which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, > we want to move HoodieEngineContext to hudi-common to be able to parallelize > the file/partition listing APIs using the engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering
codecov-io edited a comment on pull request #2379: URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=h1) Report > Merging [#2379](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=desc) (d53595e) into [master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc) (6cdf59d) will **decrease** coverage by `42.57%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2379/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2379 +/- ## - Coverage 52.23% 9.65% -42.58% + Complexity 2662 48 -2614 Files 335 53 -282 Lines 149811926-13055 Branches 1506 231 -1275 - Hits 7825 186 -7639 + Misses 65331727 -4806 + Partials623 13 -610 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.65% <0.00%> (-60.00%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | | | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | ... and [312 more](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree-more)
[jira] [Created] (HUDI-1510) Move HoodieEngineContext to hudi-common module
Udit Mehrotra created HUDI-1510: --- Summary: Move HoodieEngineContext to hudi-common module Key: HUDI-1510 URL: https://issues.apache.org/jira/browse/HUDI-1510 Project: Apache Hudi Issue Type: Bug Components: Common Core, Writer Core Reporter: Udit Mehrotra Assignee: Udit Mehrotra We want to parallelize file system listings in Hudi, which after RFC-15 implementation will be happening through class *FileSystemBackedMetadata* which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, we want to move HoodieEngineContext to hudi-common to be able to parallelize the file/partition listing APIs using the engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1510) Move HoodieEngineContext to hudi-common module
[ https://issues.apache.org/jira/browse/HUDI-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HUDI-1510: Issue Type: Improvement (was: Bug) > Move HoodieEngineContext to hudi-common module > -- > > Key: HUDI-1510 > URL: https://issues.apache.org/jira/browse/HUDI-1510 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > We want to parallelize file system listings in Hudi, which after RFC-15 > implementation will be happening through class *FileSystemBackedMetadata* > which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, > we want to move HoodieEngineContext to hudi-common to be able to parallelize > the file/partition listing APIs using the engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
codecov-io edited a comment on pull request #2405: URL: https://github.com/apache/hudi/pull/2405#issuecomment-754665459 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=h1) Report > Merging [#2405](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=desc) (9bded33) into [master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc) (698694a) will **decrease** coverage by `40.19%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2405/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2405 +/- ## = - Coverage 50.23% 10.04% -40.20% + Complexity 2985 48 -2937 = Files 410 52 -358 Lines 18398 1852-16546 Branches 1884 223 -1661 = - Hits 9242 186 -9056 + Misses 8398 1653 -6745 + Partials758 13 -745 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) |
[GitHub] [hudi] Nieal-Yang commented on pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client
Nieal-Yang commented on pull request #2375: URL: https://github.com/apache/hudi/pull/2375#issuecomment-755103932 > > > > Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT? > > > > > > > > > @Nieal-Yang sorry about the late reply. IMO The batch mode will work but not fully taking the advantage of flink. But we can optimize this step by step. To ensure this PR will work, would you add some unit tests to this PR? > > > > > > okay. This index has been used in production in my company. We will optimize it gradually > > Hi @Nieal-Yang Would you please tell me what's the Flink version you are using? @yanghua Yeah. The version we are using is flink-1.11.2. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils
codecov-io edited a comment on pull request #2405: URL: https://github.com/apache/hudi/pull/2405#issuecomment-754665459 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org