date:20210105

[GitHub] [hudi] hughfdjackson commented on issue #2265: Arrays with nulls in them result in broken parquet files

2021-01-05 Thread GitBox



hughfdjackson commented on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-754502750


   @umehrot2 
   
   Thanks for looking into this - I'm taking a bit of hope from error message 
of the code you linked ;) 
   
   
![image](https://user-images.githubusercontent.com/545689/103626620-5c474380-4f34-11eb-8ad9-a309b10b6d36.png)
   
   Have you had any further thoughts on this issue?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-913) Update docs about KeyGenerator

2021-01-05 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-913:
--
Status: Open  (was: New)

> Update docs about KeyGenerator
> --
>
> Key: HUDI-913
> URL: https://issues.apache.org/jira/browse/HUDI-913
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
>
> update default values about `KeyGenerator`
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-913) Update docs about KeyGenerator

2021-01-05 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-913:
--
Fix Version/s: 0.7.0

> Update docs about KeyGenerator
> --
>
> Key: HUDI-913
> URL: https://issues.apache.org/jira/browse/HUDI-913
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> update default values about `KeyGenerator`
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-913) Update docs about KeyGenerator

2021-01-05 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-913.
-
Resolution: Done

> Update docs about KeyGenerator
> --
>
> Key: HUDI-913
> URL: https://issues.apache.org/jira/browse/HUDI-913
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> update default values about `KeyGenerator`
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (21792c6) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2374   +/-   ##
   =
   - Coverage 50.23%   10.04%   -40.20% 
   + Complexity 2985   48 -2937 
   =
 Files   410   52  -358 
 Lines 18398 1852-16546 
 Branches   1884  223 -1661 
   =
   - Hits   9242  186 -9056 
   + Misses 8398 1653 -6745 
   + Partials758   13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 |

[GitHub] [hudi] wangxianghu opened a new pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



wangxianghu opened a new pull request #2405:
URL: https://github.com/apache/hudi/pull/2405


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1506) Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1506:
--
Description: 
{code:java}
//
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in stage 
4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): 
org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) field 
not found in record. Acceptable fields were :[vin, uuid, commercialType, 
businessType, vehicleNo, plateColor, vehicleColor, engineId, nextFixDate, 
feePrintId, transArea, createTime, updateTime, registerDate, curVehicleNo, 
reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, certificate, 
transAgency, transAgencyNet, transDateStart, transDateStop, insurCom, insurNum, 
insurType, insurCount, insurEff, insurExp, insurCreateTime, insurUpdateTime, 
curVehicleCertno, reportVehicleCertno, seats, brand, vehicleType, fuelType, 
engineDisplace, photo, enginePower, gpsBrand, gpsModel, gpsImei, 
gpsInstallDate, curDriverUuid, reportDrivers, curTimeOn, curTimeOff, timeFrom, 
timeTo, ownerName, fixState, checkState, photoId, photoIdUrl, fareType, 
wheelBase, vehicleTec, vehicleSafe, lesseeName, lesseeCode, sdcOperationType, 
hivePartition, etlDatetime]
{code}
we can see the `etlDatetime` do exist. it is caused by null value acturally

  was:Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 
in stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): 
org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) field 
not found in record. Acceptable fields were :[vin, uuid, commercialType, 
businessType, vehicleNo, plateColor, vehicleColor, engineId, nextFixDate, 
feePrintId, transArea, createTime, updateTime, registerDate, curVehicleNo, 
reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, certificate, 
transAgency, transAgencyNet, transDateStart, transDateStop, insurCom, insurNum, 
insurType, insurCount, insurEff, insurExp, insurCreateTime, insurUpdateTime, 
curVehicleCertno, reportVehicleCertno, seats, brand, vehicleType, fuelType, 
engineDisplace, photo, enginePower, gpsBrand, gpsModel, gpsImei, 
gpsInstallDate, curDriverUuid, reportDrivers, curTimeOn, curTimeOff, timeFrom, 
timeTo, ownerName, fixState, checkState, photoId, photoIdUrl, fareType, 
wheelBase, vehicleTec, vehicleSafe, lesseeName, lesseeCode, sdcOperationType, 
hivePartition, etlDatetime]


> Fix wrong exception thrown in HoodieAvroUtils
> -
>
> Key: HUDI-1506
> URL: https://issues.apache.org/jira/browse/HUDI-1506
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>
> {code:java}
> //
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): 
> org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) 
> field not found in record. Acceptable fields were :[vin, uuid, 
> commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, 
> nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, 
> curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, 
> certificate, transAgency, transAgencyNet, transDateStart, transDateStop, 
> insurCom, insurNum, insurType, insurCount, insurEff, insurExp, 
> insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, 
> seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, 
> gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, 
> curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, 
> photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, 
> lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime]
> {code}
> we can see the `etlDatetime` do exist. it is caused by null value acturally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] wangxianghu commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



wangxianghu commented on pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#issuecomment-754644106


   @yanghua please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangxianghu commented on pull request #2404: [MINOR] Add Jira URL and Mailing List

2021-01-05 Thread GitBox



wangxianghu commented on pull request #2404:
URL: https://github.com/apache/hudi/pull/2404#issuecomment-754643863


   @yanghua please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SureshK-T2S opened a new issue #2406: [SUPPORT] Deltastreamer - Property hoodie.datasource.write.partitionpath.field not found

2021-01-05 Thread GitBox



SureshK-T2S opened a new issue #2406:
URL: https://github.com/apache/hudi/issues/2406


   I am attempting to create a hudi table using a parquet file on S3. The 
motivation for this approach is based on this Hudi blog: 
   
https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi
   
   To first attempt usage of deltastreamer to ingest a full initial batch load, 
I attempted to use parquet files used in an aws blog at 
s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/
   
https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/
   
   At first I used the spark shell on EMR to load the data into a dataframe and 
view it, this happens with no issues:
   
   
![image](https://user-images.githubusercontent.com/17935082/103654795-59783d00-4f8c-11eb-86bd-a1f0cd7db3f3.png)
   
   I then attempted to use Hudi Deltastreamer as per my understanding of the 
documentation, however I ran into a couple of issues.
   
   
   Steps to reproduce the behavior:
   
   1. Ran the following:
   ```
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4\
 --master yarn --deploy-mode client \
   /usr/lib/hudi/hudi-utilities-bundle.jar --table-type MERGE_ON_READ \
 --source-ordering-field request_timestamp \
 --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
 --target-base-path s3://mysqlcdc-stream-prod/hudi_tryout/hudi_aws_test 
--target-table hudi_aws_test \
   --hoodie-conf 
hoodie.datasource.write.recordkey.field=request_timestamp,hoodie.deltastreamer.source.dfs.root=s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1,hoodie.datasource.write.partitionpath.field=request_timestamp:TIMESTAMP
   ```
   Stacktrace:
   ```Exception in thread "main" java.io.IOException: Could not load key 
generator class org.apache.hudi.keygen.SimpleKeyGenerator
at 
org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:94)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:190)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:552)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:129)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:99)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:464)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate 
class 
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:98)
at 
org.apache.hudi.DataSourceUtils.createKeyGenerator(DataSourceUtils.java:92)
... 17 more
   Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:87)
... 19 more
   Caused by: java.lang.IllegalArgumentException: Property 
hoodie.datasource.write.partitionpath.field not found
at 
org.apache.hudi.common.config.TypedProperties.checkKey(TypedProperties.java:42)
at 
org.apache.hudi.common.config.TypedProperties.getString(TypedProperties.java:47)
at 
org.apache.hudi.keygen.SimpleKeyGenerator.(SimpleKeyGenerator.java:36)
... 24 more
   ```

[GitHub] [hudi] codecov-io commented on pull request #2404: [MINOR] Add Jira URL and Mailing List

2021-01-05 Thread GitBox



codecov-io commented on pull request #2404:
URL: https://github.com/apache/hudi/pull/2404#issuecomment-754638146


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=h1) Report
   > Merging 
[#2404](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=desc) (fdeb851) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2404/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2404   +/-   ##
   =
   - Coverage 50.23%   10.04%   -40.20% 
   + Complexity 2985   48 -2937 
   =
 Files   410   52  -358 
 Lines 18398 1852-16546 
 Branches   1884  223 -1661 
   =
   - Hits   9242  186 -9056 
   + Misses 8398 1653 -6745 
   + Partials758   13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2404?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2404/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00%

[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=h1) Report
   > Merging 
[#2379](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=desc) (70ffbba) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc)
 (6cdf59d) will **decrease** coverage by `42.58%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2379/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2379   +/-   ##
   
   - Coverage 52.23%   9.65%   -42.59% 
   + Complexity 2662  48 -2614 
   
 Files   335  53  -282 
 Lines 149811927-13054 
 Branches   1506 231 -1275 
   
   - Hits   7825 186 -7639 
   + Misses 65331728 -4805 
   + Partials623  13  -610 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.65% <0.00%> (-60.01%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | ... and [312 
more](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree-more)

[jira] [Created] (HUDI-1506) Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread wangxianghu (Jira)

wangxianghu created HUDI-1506:
-

 Summary: Fix wrong exception thrown in HoodieAvroUtils
 Key: HUDI-1506
 URL: https://issues.apache.org/jira/browse/HUDI-1506
 Project: Apache Hudi
  Issue Type: Bug
Reporter: wangxianghu
Assignee: wangxianghu


Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in stage 
4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): 
org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) field 
not found in record. Acceptable fields were :[vin, uuid, commercialType, 
businessType, vehicleNo, plateColor, vehicleColor, engineId, nextFixDate, 
feePrintId, transArea, createTime, updateTime, registerDate, curVehicleNo, 
reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, certificate, 
transAgency, transAgencyNet, transDateStart, transDateStop, insurCom, insurNum, 
insurType, insurCount, insurEff, insurExp, insurCreateTime, insurUpdateTime, 
curVehicleCertno, reportVehicleCertno, seats, brand, vehicleType, fuelType, 
engineDisplace, photo, enginePower, gpsBrand, gpsModel, gpsImei, 
gpsInstallDate, curDriverUuid, reportDrivers, curTimeOn, curTimeOff, timeFrom, 
timeTo, ownerName, fixState, checkState, photoId, photoIdUrl, fareType, 
wheelBase, vehicleTec, vehicleSafe, lesseeName, lesseeCode, sdcOperationType, 
hivePartition, etlDatetime]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1506) Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1506:
-
Labels: pull-request-available  (was: )

> Fix wrong exception thrown in HoodieAvroUtils
> -
>
> Key: HUDI-1506
> URL: https://issues.apache.org/jira/browse/HUDI-1506
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> //
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 4.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 4.0 (TID 24, al-prd-dtp-data-lake-10-0-88-26, executor 4): 
> org.apache.hudi.exception.HoodieException: etlDatetime(Part -etlDatetime) 
> field not found in record. Acceptable fields were :[vin, uuid, 
> commercialType, businessType, vehicleNo, plateColor, vehicleColor, engineId, 
> nextFixDate, feePrintId, transArea, createTime, updateTime, registerDate, 
> curVehicleNo, reportVehicleNo, model, checkDate, certifyDateA, certifyDateB, 
> certificate, transAgency, transAgencyNet, transDateStart, transDateStop, 
> insurCom, insurNum, insurType, insurCount, insurEff, insurExp, 
> insurCreateTime, insurUpdateTime, curVehicleCertno, reportVehicleCertno, 
> seats, brand, vehicleType, fuelType, engineDisplace, photo, enginePower, 
> gpsBrand, gpsModel, gpsImei, gpsInstallDate, curDriverUuid, reportDrivers, 
> curTimeOn, curTimeOff, timeFrom, timeTo, ownerName, fixState, checkState, 
> photoId, photoIdUrl, fareType, wheelBase, vehicleTec, vehicleSafe, 
> lesseeName, lesseeCode, sdcOperationType, hivePartition, etlDatetime]
> {code}
> we can see the `etlDatetime` do exist. it is caused by null value acturally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] wangxianghu opened a new pull request #2404: [MINOR] Add Jira URL and Mailing List

2021-01-05 Thread GitBox



wangxianghu opened a new pull request #2404:
URL: https://github.com/apache/hudi/pull/2404


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua merged pull request #2403: [HUDI-913] Update docs about KeyGenerator

2021-01-05 Thread GitBox



yanghua merged pull request #2403:
URL: https://github.com/apache/hudi/pull/2403


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: [HUDI-913] Update docs about KeyGenerator (#2403)

2021-01-05 Thread vinoyang

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new ee00bd6  [HUDI-913] Update docs about KeyGenerator (#2403)
ee00bd6 is described below

commit ee00bd6689f151b7a6a1a197aced8699d373dd79
Author: wangxianghu 
AuthorDate: Tue Jan 5 18:21:08 2021 +0800

[HUDI-913] Update docs about KeyGenerator (#2403)
---
 docs/_docs/2_4_configurations.cn.md | 2 +-
 docs/_docs/2_4_configurations.md| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/_docs/2_4_configurations.cn.md 
b/docs/_docs/2_4_configurations.cn.md
index 9436971..07e8c02 100644
--- a/docs/_docs/2_4_configurations.cn.md
+++ b/docs/_docs/2_4_configurations.cn.md
@@ -83,7 +83,7 @@ inputDF.write()
   如果设置为true，则生成基于Hive格式的partition目录：=
 
  KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY}
-  属性：`hoodie.datasource.write.keygenerator.class`, 
默认值：`org.apache.hudi.SimpleKeyGenerator` 
+  属性：`hoodie.datasource.write.keygenerator.class`, 
默认值：`org.apache.hudi.keygen.SimpleKeyGenerator` 
   键生成器类，实现从输入的`Row`对象中提取键
   
  COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index aa472dd..244688b 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -81,7 +81,7 @@ Actual value ontained by invoking .toString()
   When set to true, partition folder names follow the 
format of Hive partitions: =
 
  KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY}
-  Property: `hoodie.datasource.write.keygenerator.class`, Default: 
`org.apache.hudi.SimpleKeyGenerator` 
+  Property: `hoodie.datasource.write.keygenerator.class`, Default: 
`org.apache.hudi.keygen.SimpleKeyGenerator` 
   Key generator class, that implements will extract 
the key out of incoming `Row` object
   
  COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}

[GitHub] [hudi] liujinhui1994 closed pull request #2386: [HUDI-1160] Support update partial fields for CoW table

2021-01-05 Thread GitBox



liujinhui1994 closed pull request #2386:
URL: https://github.com/apache/hudi/pull/2386


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: Travis CI build asf-site

2021-01-05 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new f7ca68a  Travis CI build asf-site
f7ca68a is described below

commit f7ca68af86ff254adf064bd4becb4aca02a001cf
Author: CI 
AuthorDate: Tue Jan 5 12:20:10 2021 +

Travis CI build asf-site
---
 content/cn/docs/configurations.html | 2 +-
 content/docs/configurations.html| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/cn/docs/configurations.html 
b/content/cn/docs/configurations.html
index 1456a71..57aa993 100644
--- a/content/cn/docs/configurations.html
+++ b/content/cn/docs/configurations.html
@@ -450,7 +450,7 @@
   如果设置为true，则生成基于Hive格式的partition目录：=/span
 
 KEYGENERATOR_CLASS_OPT_KEY
-属性：hoodie.datasource.write.keygenerator.class, 
默认值：org.apache.hudi.SimpleKeyGenerator 

+属性：hoodie.datasource.write.keygenerator.class, 
默认值：org.apache.hudi.keygen.SimpleKeyGenerator 
   键生成器类，实现从输入的Row对象中提取键
 
 COMMIT_METADATA_KEYPREFIX_OPT_KEY
diff --git a/content/docs/configurations.html b/content/docs/configurations.html
index c274063..ef5f099 100644
--- a/content/docs/configurations.html
+++ b/content/docs/configurations.html
@@ -459,7 +459,7 @@ Actual value ontained by invoking .toString()
   When set to true, partition folder names follow the 
format of Hive partitions: 
=/span
 
 KEYGENERATOR_CLASS_OPT_KEY
-Property: hoodie.datasource.write.keygenerator.class, 
Default: org.apache.hudi.SimpleKeyGenerator 
+Property: hoodie.datasource.write.keygenerator.class, 
Default: org.apache.hudi.keygen.SimpleKeyGenerator 
   Key generator class, that implements will extract 
the key out of incoming Row 
object
 
 COMMIT_METADATA_KEYPREFIX_OPT_KEY

[GitHub] [hudi] codecov-io commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



codecov-io commented on pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#issuecomment-754665459


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=h1) Report
   > Merging 
[#2405](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=desc) (b51e61e) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2405/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2405   +/-   ##
   =
   - Coverage 50.23%   10.04%   -40.20% 
   + Complexity 2985   48 -2937 
   =
 Files   410   52  -358 
 Lines 18398 1852-16546 
 Branches   1884  223 -1661 
   =
   - Hits   9242  186 -9056 
   + Misses 8398 1653 -6745 
   + Partials758   13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00%

[GitHub] [hudi] vinothchandar commented on a change in pull request #2359: [HUDI-1486] Remove inflight rollback in hoodie writer

2021-01-05 Thread GitBox



vinothchandar commented on a change in pull request #2359:
URL: https://github.com/apache/hudi/pull/2359#discussion_r551618729



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -232,17 +250,18 @@ void emitCommitMetrics(String instantTime, 
HoodieCommitMetadata metadata, String
* Main API to run bootstrap to hudi.
*/
   public void bootstrap(Option> extraMetadata) {
-if (rollbackPending) {
-  rollBackInflightBootstrap();
-}
+// TODO : MULTIWRITER -> check if failed bootstrap files can be cleaned 
later
 HoodieTable table = 
getTableAndInitCtx(WriteOperationType.UPSERT, 
HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS);
+if (isEager(config.getFailedWritesCleanPolicy())) {

Review comment:
   `config` is a member variable right? why do we pass it in to the checks? 
Can we just do `eagerCleanFailedWrites()`, which does he if block and the call 
to `rollbackFailedBootstrap()`?

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieClient.java
##
@@ -70,6 +72,8 @@ protected AbstractHoodieClient(HoodieEngineContext context, 
HoodieWriteConfig cl
 this.config = clientConfig;
 this.timelineServer = timelineServer;
 shouldStopTimelineServer = !timelineServer.isPresent();
+this.heartbeatClient = new HoodieHeartbeatClient(this.fs, this.basePath,
+clientConfig.getHoodieClientHeartbeatIntervalInMs(), 
clientConfig.getHoodieClientHeartbeatTolerableMisses());

Review comment:
   rename to be shorter? drop the `HoodieClient` or `Hoodie` part in the 
names?
   

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -707,24 +739,51 @@ public void rollbackInflightCompaction(HoodieInstant 
inflightInstant, HoodieTabl
   }
 
   /**
-   * Cleanup all pending commits.
+   * Rollback all failed commits.
*/
-  private void rollbackPendingCommits() {
+  public void rollbackFailedCommits() {
 HoodieTable table = createTable(config, hadoopConf);
-HoodieTimeline inflightTimeline = 
table.getMetaClient().getCommitsTimeline().filterPendingExcludingCompaction();
-List commits = 
inflightTimeline.getReverseOrderedInstants().map(HoodieInstant::getTimestamp)
-.collect(Collectors.toList());
-for (String commit : commits) {
-  if (HoodieTimeline.compareTimestamps(commit, 
HoodieTimeline.LESSER_THAN_OR_EQUALS,
+List instantsToRollback = getInstantsToRollback(table);
+for (String instant : instantsToRollback) {
+  if (HoodieTimeline.compareTimestamps(instant, 
HoodieTimeline.LESSER_THAN_OR_EQUALS,
   HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS)) {
-rollBackInflightBootstrap();
+rollbackFailedBootstrap();
 break;
   } else {
-rollback(commit);
+rollback(instant);
+  }
+  // Delete the heartbeats from DFS
+  if (!isEager(config.getFailedWritesCleanPolicy())) {
+try {
+  this.heartbeatClient.delete(instant);
+} catch (IOException io) {
+  LOG.error(io);
+}
   }
 }
   }
 
+  private List getInstantsToRollback(HoodieTable table) {
+if (isEager(config.getFailedWritesCleanPolicy())) {
+  HoodieTimeline inflightTimeline = 
table.getMetaClient().getCommitsTimeline().filterPendingExcludingCompaction();
+  return 
inflightTimeline.getReverseOrderedInstants().map(HoodieInstant::getTimestamp)
+  .collect(Collectors.toList());
+} else if (!isEager(config.getFailedWritesCleanPolicy())) {
+  return table.getMetaClient().getActiveTimeline()
+  
.getCommitsTimeline().filterInflights().getReverseOrderedInstants().filter(instant
 -> {
+try {
+  return 
heartbeatClient.isHeartbeatExpired(instant.getTimestamp(), 
System.currentTimeMillis());

Review comment:
   drop the second argument to `isHeartbeatExpired()` and have it get the 
current time epoch inside?

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -707,24 +739,51 @@ public void rollbackInflightCompaction(HoodieInstant 
inflightInstant, HoodieTabl
   }
 
   /**
-   * Cleanup all pending commits.
+   * Rollback all failed commits.
*/
-  private void rollbackPendingCommits() {
+  public void rollbackFailedCommits() {
 HoodieTable table = createTable(config, hadoopConf);
-HoodieTimeline inflightTimeline = 
table.getMetaClient().getCommitsTimeline().filterPendingExcludingCompaction();
-List commits = 
inflightTimeline.getReverseOrderedInstants().map(HoodieInstant::getTimestamp)
-.collect(Collectors.toList());
-for (String commit : commits) {
-  if (HoodieTimeline.compareTimestamps(commit, 
HoodieTimeline.LESSER_THAN_OR_EQUALS,
+List instantsToRollback = getInstantsToRollback(table);
+

[jira] [Updated] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

2021-01-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1479:
-
Description: 
*Change #1*
{code:java}
public static List getAllPartitionPaths(FileSystem fs, String 
basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
  boolean 
assumeDatePartitioning) throws IOException {
if (assumeDatePartitioning) {
  return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
} else {
  HoodieTableMetadata tableMetadata = 
HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
useFileListingFromMetadata,
  verifyListings, false, false);
  return tableMetadata.getAllPartitionPaths();
}
 }
{code}
is the current implementation, where `HoodieTableMetadata.create()` always 
creates `HoodieBackedTableMetadata`. Instead we should create 
`FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways

*Change #2*

On master, we have the `HoodieEngineContext` abstraction, which allows for 
parallel execution. We should consider moving it to `hudi-common` (its doable) 
and then have `FileSystemBackedTableMetadata` redone such that it can do 
parallelized listings using the passed in engine. either 
HoodieSparkEngineContext or HoodieJavaEngineContext. 
HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
code. We should take one pass and see if that can be redone a bit as well.  

 

*Change #3*

There are places, where we call fs.listStatus() directly. We should make them 
go through the HoodieTable.getMetadata()... route as well. Essentially, all 
listing should be concentrated to `FileSystemBackedTableMetadata`

!image-2021-01-05-10-00-35-187.png!

  was:
- Can be done once rfc-15 is merged into master. 

 - We will also use HoodieEngineContext and perform parallelized listing as 
needed. 
 - HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
code
 - Chase down direct usages of  FileSystemBackedTableMetadata and parallelize 
them as well


> Replace FSUtils.getAllPartitionPaths() with 
> HoodieTableMetadata#getAllPartitionPaths()
> --
>
> Key: HUDI-1479
> URL: https://issues.apache.org/jira/browse/HUDI-1479
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List getAllPartitionPaths(FileSystem fs, String 
> basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
>   boolean 
> assumeDatePartitioning) throws IOException {
> if (assumeDatePartitioning) {
>   return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
> } else {
>   HoodieTableMetadata tableMetadata = 
> HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
> useFileListingFromMetadata,
>   verifyListings, false, false);
>   return tableMetadata.getAllPartitionPaths();
> }
>  }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always 
> creates `HoodieBackedTableMetadata`. Instead we should create 
> `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways
> *Change #2*
> On master, we have the `HoodieEngineContext` abstraction, which allows for 
> parallel execution. We should consider moving it to `hudi-common` (its 
> doable) and then have `FileSystemBackedTableMetadata` redone such that it can 
> do parallelized listings using the passed in engine. either 
> HoodieSparkEngineContext or HoodieJavaEngineContext. 
> HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
> code. We should take one pass and see if that can be redone a bit as well.  
>  
> *Change #3*
> There are places, where we call fs.listStatus() directly. We should make them 
> go through the HoodieTable.getMetadata()... route as well. Essentially, all 
> listing should be concentrated to `FileSystemBackedTableMetadata`
> !image-2021-01-05-10-00-35-187.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

2021-01-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1479:
-
Attachment: image-2021-01-05-10-00-35-187.png

> Replace FSUtils.getAllPartitionPaths() with 
> HoodieTableMetadata#getAllPartitionPaths()
> --
>
> Key: HUDI-1479
> URL: https://issues.apache.org/jira/browse/HUDI-1479
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: image-2021-01-05-10-00-35-187.png
>
>
> - Can be done once rfc-15 is merged into master. 
>  - We will also use HoodieEngineContext and perform parallelized listing as 
> needed. 
>  - HoodieBackedTableMetadata#getPartitionsToFilesMapping has some 
> parallelized code
>  - Chase down direct usages of  FileSystemBackedTableMetadata and parallelize 
> them as well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan commented on a change in pull request #2400: [WIP] Some fixes to test suite framework. Adding clustering node

2021-01-05 Thread GitBox



nsivabalan commented on a change in pull request #2400:
URL: https://github.com/apache/hudi/pull/2400#discussion_r552072132



##
File path: docker/demo/config/test-suite/complex-dag-cow.yaml
##
@@ -14,41 +14,47 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 dag_name: cow-long-running-example.yaml
-dag_rounds: 2
+dag_rounds: 1

Review comment:
   ignore reviewing these two files for now:
   complex-dag-cow.yaml
   and complex-dag-mor.yaml
   I am yet to fix these. 

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/ValidateDatasetNode.java
##
@@ -111,17 +111,19 @@ public void execute(ExecutionContext context) throws 
Exception {
   throw new AssertionError("Hudi contents does not match contents input 
data. ");
 }
 
-String database = 
context.getWriterContext().getProps().getString(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY());
-String tableName = 
context.getWriterContext().getProps().getString(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY());
-log.warn("Validating hive table with db : " + database + " and table : " + 
tableName);
-Dataset cowDf = session.sql("SELECT * FROM " + database + "." + 
tableName);
-Dataset trimmedCowDf = 
cowDf.drop(HoodieRecord.COMMIT_TIME_METADATA_FIELD).drop(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD).drop(HoodieRecord.RECORD_KEY_METADATA_FIELD)
-
.drop(HoodieRecord.PARTITION_PATH_METADATA_FIELD).drop(HoodieRecord.FILENAME_METADATA_FIELD);
-intersectionDf = inputSnapshotDf.intersect(trimmedDf);
-// the intersected df should be same as inputDf. if not, there is some 
mismatch.
-if (inputSnapshotDf.except(intersectionDf).count() != 0) {
-  log.error("Data set validation failed for COW hive table. Total count in 
hudi " + trimmedCowDf.count() + ", input df count " + inputSnapshotDf.count());
-  throw new AssertionError("Hudi hive table contents does not match 
contents input data. ");
+if (config.isValidateHive()) {

Review comment:
   no changes except adding this if condition

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/FlexibleSchemaRecordGenerationIterator.java
##
@@ -41,21 +46,17 @@
   private GenericRecord lastRecord;
   // Partition path field name
   private Set partitionPathFieldNames;
-  private String firstPartitionPathField;
 
   public FlexibleSchemaRecordGenerationIterator(long maxEntriesToProduce, 
String schema) {
 this(maxEntriesToProduce, 
GenericRecordFullPayloadGenerator.DEFAULT_PAYLOAD_SIZE, schema, null, 0);
   }
 
   public FlexibleSchemaRecordGenerationIterator(long maxEntriesToProduce, int 
minPayloadSize, String schemaStr,
-  List partitionPathFieldNames, int numPartitions) {
+  List partitionPathFieldNames, int partitionIndex) {

Review comment:
   all changes in this file are part of reverting 
e33a8f733c4a9a94479c166ad13ae9d53142cd3f

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
##
@@ -127,9 +127,13 @@ public DeltaGenerator(DFSDeltaConfig deltaOutputConfig, 
JavaSparkContext jsc, Sp
 return ws;
   }
 
+  public int getBatchId() {
+return this.batchId;
+  }
+
   public JavaRDD generateInserts(Config operation) {
 int numPartitions = operation.getNumInsertPartitions();
-long recordsPerPartition = operation.getNumRecordsInsert();
+long recordsPerPartition = operation.getNumRecordsInsert() / numPartitions;

Review comment:
   these are part of reverting e33a8f733c4a9a94479c166ad13ae9d53142cd3f

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
##
@@ -140,7 +144,7 @@ public DeltaGenerator(DFSDeltaConfig deltaOutputConfig, 
JavaSparkContext jsc, Sp
 JavaRDD inputBatch = jsc.parallelize(partitionIndexes, 
numPartitions)
 .mapPartitionsWithIndex((index, p) -> {
   return new LazyRecordGeneratorIterator(new 
FlexibleSchemaRecordGenerationIterator(recordsPerPartition,
-  minPayloadSize, schemaStr, partitionPathFieldNames, 
numPartitions));
+minPayloadSize, schemaStr, partitionPathFieldNames, 
(Integer)index));

Review comment:
   these are part of reverting e33a8f733c4a9a94479c166ad13ae9d53142cd3f

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java
##
@@ -46,9 +46,9 @@
  */
 public class GenericRecordFullPayloadGenerator implements Serializable {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class);
+  private static Logger LOG = 
LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class);

Review comment:
   all changes in this file are part of reverting 
e33a8f733c4a9a94479c166ad13ae9d53142cd3f

##
File path:

[GitHub] [hudi] nsivabalan commented on pull request #2400: Some fixes and enhancements to test suite framework

2021-01-05 Thread GitBox



nsivabalan commented on pull request #2400:
URL: https://github.com/apache/hudi/pull/2400#issuecomment-754812328


   @n3nash : Patch is ready for review. 
   @satishkotha : I have added clustering node. Do check it out. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1459) Support for handling of REPLACE instants

2021-01-05 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259074#comment-17259074
 ] 

Vinoth Chandar commented on HUDI-1459:
--

[~pwason] [~satishkotha] 

several users reporting this when trying use SaveMode.Overwrite on spark data 
source and turning on metadata table

{code}
org.apache.hudi.exception.HoodieException: Unknown type of action replacecommit
  at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.convertInstantToMetaRecords(HoodieTableMetadataUtil.java:96)
  at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.syncFromInstants(HoodieBackedTableMetadataWriter.java:372)
  at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:120)
  at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:62)
  at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:58)
  at 
org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:406)
  at 
org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:417)
  at 
org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:189)
  at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:108)
  at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:439)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:224)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:129)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
  at 
org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
  at 
org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
  at 
org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
  ... 57 elided
{code}

> Support for handling of REPLACE instants
> 
>
> Key: HUDI-1459
> URL: https://issues.apache.org/jira/browse/HUDI-1459
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.7.0
>
>
> Once we rebase to master, we need to handle replace instants as well, as they 
> show up on the timeline. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

2021-01-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1479:
-
Description: 
*Change #1*
{code:java}
public static List getAllPartitionPaths(FileSystem fs, String 
basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
  boolean 
assumeDatePartitioning) throws IOException {
if (assumeDatePartitioning) {
  return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
} else {
  HoodieTableMetadata tableMetadata = 
HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
useFileListingFromMetadata,
  verifyListings, false, false);
  return tableMetadata.getAllPartitionPaths();
}
 }
{code}
is the current implementation, where `HoodieTableMetadata.create()` always 
creates `HoodieBackedTableMetadata`. Instead we should create 
`FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. 
This helps address https://github.com/apache/hudi/pull/2398/files#r550709687

*Change #2*

On master, we have the `HoodieEngineContext` abstraction, which allows for 
parallel execution. We should consider moving it to `hudi-common` (its doable) 
and then have `FileSystemBackedTableMetadata` redone such that it can do 
parallelized listings using the passed in engine. either 
HoodieSparkEngineContext or HoodieJavaEngineContext. 
HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
code. We should take one pass and see if that can be redone a bit as well.  
Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216

 

*Change #3*

There are places, where we call fs.listStatus() directly. We should make them 
go through the HoodieTable.getMetadata()... route as well. Essentially, all 
listing should be concentrated to `FileSystemBackedTableMetadata`

!image-2021-01-05-10-00-35-187.png!

  was:
*Change #1*
{code:java}
public static List getAllPartitionPaths(FileSystem fs, String 
basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
  boolean 
assumeDatePartitioning) throws IOException {
if (assumeDatePartitioning) {
  return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
} else {
  HoodieTableMetadata tableMetadata = 
HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
useFileListingFromMetadata,
  verifyListings, false, false);
  return tableMetadata.getAllPartitionPaths();
}
 }
{code}
is the current implementation, where `HoodieTableMetadata.create()` always 
creates `HoodieBackedTableMetadata`. Instead we should create 
`FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways

*Change #2*

On master, we have the `HoodieEngineContext` abstraction, which allows for 
parallel execution. We should consider moving it to `hudi-common` (its doable) 
and then have `FileSystemBackedTableMetadata` redone such that it can do 
parallelized listings using the passed in engine. either 
HoodieSparkEngineContext or HoodieJavaEngineContext. 
HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
code. We should take one pass and see if that can be redone a bit as well.  

 

*Change #3*

There are places, where we call fs.listStatus() directly. We should make them 
go through the HoodieTable.getMetadata()... route as well. Essentially, all 
listing should be concentrated to `FileSystemBackedTableMetadata`

!image-2021-01-05-10-00-35-187.png!


> Replace FSUtils.getAllPartitionPaths() with 
> HoodieTableMetadata#getAllPartitionPaths()
> --
>
> Key: HUDI-1479
> URL: https://issues.apache.org/jira/browse/HUDI-1479
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List getAllPartitionPaths(FileSystem fs, String 
> basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
>   boolean 
> assumeDatePartitioning) throws IOException {
> if (assumeDatePartitioning) {
>   return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
> } else {
>   HoodieTableMetadata tableMetadata = 
> HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
> useFileListingFromMetadata,
>   verifyListings, false, false);
>   return tableMetadata.getAllPartitionPaths();
> }
>  }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always 
> creates `HoodieBackedTableMetadata`.

[jira] [Commented] (HUDI-1308) Issues found during testing RFC-15

2021-01-05 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259072#comment-17259072
 ] 

Vinoth Chandar commented on HUDI-1308:
--

More testing on S3 from [~vbalaji]

{code}
Caused by: org.apache.hudi.exception.HoodieIOException: getFileStatus on 
s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test5_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile:
 com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
waiting for connection from pool
{code}

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.7.0
>
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

2021-01-05 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259090#comment-17259090
 ] 

Vinoth Chandar commented on HUDI-1479:
--

[~uditme] I have updated the description with detailed steps to do. I think 
this will be a super impactful change. let me know if you are able to pick this 
up.

> Replace FSUtils.getAllPartitionPaths() with 
> HoodieTableMetadata#getAllPartitionPaths()
> --
>
> Key: HUDI-1479
> URL: https://issues.apache.org/jira/browse/HUDI-1479
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List getAllPartitionPaths(FileSystem fs, String 
> basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
>   boolean 
> assumeDatePartitioning) throws IOException {
> if (assumeDatePartitioning) {
>   return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
> } else {
>   HoodieTableMetadata tableMetadata = 
> HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", 
> useFileListingFromMetadata,
>   verifyListings, false, false);
>   return tableMetadata.getAllPartitionPaths();
> }
>  }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always 
> creates `HoodieBackedTableMetadata`. Instead we should create 
> `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. 
> This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
> *Change #2*
> On master, we have the `HoodieEngineContext` abstraction, which allows for 
> parallel execution. We should consider moving it to `hudi-common` (its 
> doable) and then have `FileSystemBackedTableMetadata` redone such that it can 
> do parallelized listings using the passed in engine. either 
> HoodieSparkEngineContext or HoodieJavaEngineContext. 
> HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized 
> code. We should take one pass and see if that can be redone a bit as well.  
> Food for thought: 
> https://github.com/apache/hudi/pull/2398#discussion_r550711216
>  
> *Change #3*
> There are places, where we call fs.listStatus() directly. We should make them 
> go through the HoodieTable.getMetadata()... route as well. Essentially, all 
> listing should be concentrated to `FileSystemBackedTableMetadata`
> !image-2021-01-05-10-00-35-187.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



yanghua commented on a change in pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#discussion_r551988234



##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -428,10 +429,14 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
 
 if (returnNullIfNotFound) {
   return null;
-} else {
+}
+
+if (recordSchema.getField(parts[i]) == null) {

Review comment:
   Can it be `else if`?

##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -428,10 +429,14 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
 
 if (returnNullIfNotFound) {
   return null;
-} else {
+}
+
+if (recordSchema.getField(parts[i]) == null) {
   throw new HoodieException(
   fieldName + "(Part -" + parts[i] + ") field not found in record. 
Acceptable fields were :"
   + 
valueNode.getSchema().getFields().stream().map(Field::name).collect(Collectors.toList()));

Review comment:
   Can we reuse `recordSchema ` here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=h1) Report
   > Merging 
[#2379](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=desc) (4cc4b35) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc)
 (6cdf59d) will **decrease** coverage by `42.58%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2379/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2379   +/-   ##
   
   - Coverage 52.23%   9.65%   -42.59% 
   + Complexity 2662  48 -2614 
   
 Files   335  53  -282 
 Lines 149811927-13054 
 Branches   1506 231 -1275 
   
   - Hits   7825 186 -7639 
   + Misses 65331728 -4805 
   + Partials623  13  -610 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.65% <0.00%> (-60.01%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | ... and [312 
more](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree-more)

[GitHub] [hudi] afilipchik commented on a change in pull request #2380: [Hudi 73] Adding support for vanilla AvroKafkaSource

2021-01-05 Thread GitBox



afilipchik commented on a change in pull request #2380:
URL: https://github.com/apache/hudi/pull/2380#discussion_r552037619



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.serde;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import 
org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig;
+
+import kafka.utils.VerifiableProperties;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.specific.SpecificDatumReader;
+import org.apache.kafka.common.errors.SerializationException;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Properties;
+
+import static 
org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig.SCHEMA_PROVIDER_CLASS_PROP;
+
+public class AbstractHoodieKafkaAvroDeserializer {
+
+  private final DecoderFactory decoderFactory = DecoderFactory.get();
+  private boolean useSpecificAvroReader = false;
+  private Schema sourceSchema;
+  private Schema targetSchema;
+
+  public AbstractHoodieKafkaAvroDeserializer(VerifiableProperties properties) {
+// this.sourceSchema = new 
Schema.Parser().parse(properties.props().getProperty(FilebasedSchemaProvider.Config.SOURCE_SCHEMA_PROP));
+TypedProperties typedProperties = new TypedProperties();
+copyProperties(typedProperties, properties.props());
+try {
+  SchemaProvider schemaProvider = UtilHelpers.createSchemaProvider(
+  typedProperties.getString(SCHEMA_PROVIDER_CLASS_PROP), 
typedProperties, null);
+  this.sourceSchema = 
Objects.requireNonNull(schemaProvider).getSourceSchema();
+  this.targetSchema = 
Objects.requireNonNull(schemaProvider).getTargetSchema();
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  private void copyProperties(TypedProperties typedProperties, Properties 
properties) {
+for (Map.Entry entry : properties.entrySet()) {
+  typedProperties.put(entry.getKey(), entry.getValue());
+}
+  }
+
+  protected void configure(HoodieKafkaAvroDeserializationConfig config) {
+useSpecificAvroReader = config
+
.getBoolean(HoodieKafkaAvroDeserializationConfig.SPECIFIC_AVRO_READER_CONFIG);
+  }
+
+  protected Object deserialize(byte[] payload) throws SerializationException {
+return deserialize(null, null, payload, targetSchema);
+  }
+
+  /**
+   * Just like single-parameter version but accepts an Avro schema to use for 
reading.
+   *
+   * @param payload serialized data
+   * @param readerSchema schema to use for Avro read (optional, enables Avro 
projection)
+   * @return the deserialized object
+   */
+  protected Object deserialize(byte[] payload, Schema readerSchema) throws 
SerializationException {
+return deserialize(null, null, payload, readerSchema);
+  }
+
+  protected Object deserialize(String topic, Boolean isKey, byte[] payload, 
Schema readerSchema) {

Review comment:
   It should be coming from the configured schema provider. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1507) Hive sync having issues w/ Clustering

2021-01-05 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-1507:
-

 Summary: Hive sync having issues w/ Clustering
 Key: HUDI-1507
 URL: https://issues.apache.org/jira/browse/HUDI-1507
 Project: Apache Hudi
  Issue Type: Bug
  Components: Storage Management
Affects Versions: 0.7.0
Reporter: sivabalan narayanan


I was trying out clustering w/ test suite job and ran into hive sync issues.

 

21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
5522853c-653b-4d92-acf4-d299c263a77f

21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
instant time :20210105164505 clustering strategy 
org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
 clustering sort cols : _row_key, target partitions for clustering :: 0, inline 
cluster max commit : 1

21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
20210105164505

21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
\{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}

21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing

org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
replacecommit

 at 
org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)

 at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)

 at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)

 at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)

 at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)

 at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)

 at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)

 at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)

 at 
org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)

 at 
org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)

 at 
org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)

 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)

 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)

 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)

 at 
org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)

 at 
org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)

 at 
org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)

 at 
org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)

 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

 at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] afilipchik commented on a change in pull request #2380: [Hudi 73] Adding support for vanilla AvroKafkaSource

2021-01-05 Thread GitBox



afilipchik commented on a change in pull request #2380:
URL: https://github.com/apache/hudi/pull/2380#discussion_r552040233



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.serde;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import 
org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig;
+
+import kafka.utils.VerifiableProperties;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.specific.SpecificDatumReader;
+import org.apache.kafka.common.errors.SerializationException;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Properties;
+
+import static 
org.apache.hudi.utilities.serde.config.HoodieKafkaAvroDeserializationConfig.SCHEMA_PROVIDER_CLASS_PROP;
+
+public class AbstractHoodieKafkaAvroDeserializer {
+
+  private final DecoderFactory decoderFactory = DecoderFactory.get();
+  private boolean useSpecificAvroReader = false;
+  private Schema sourceSchema;
+  private Schema targetSchema;
+
+  public AbstractHoodieKafkaAvroDeserializer(VerifiableProperties properties) {
+// this.sourceSchema = new 
Schema.Parser().parse(properties.props().getProperty(FilebasedSchemaProvider.Config.SOURCE_SCHEMA_PROP));
+TypedProperties typedProperties = new TypedProperties();
+copyProperties(typedProperties, properties.props());
+try {
+  SchemaProvider schemaProvider = UtilHelpers.createSchemaProvider(
+  typedProperties.getString(SCHEMA_PROVIDER_CLASS_PROP), 
typedProperties, null);
+  this.sourceSchema = 
Objects.requireNonNull(schemaProvider).getSourceSchema();
+  this.targetSchema = 
Objects.requireNonNull(schemaProvider).getTargetSchema();
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  private void copyProperties(TypedProperties typedProperties, Properties 
properties) {
+for (Map.Entry entry : properties.entrySet()) {
+  typedProperties.put(entry.getKey(), entry.getValue());
+}
+  }
+
+  protected void configure(HoodieKafkaAvroDeserializationConfig config) {
+useSpecificAvroReader = config
+
.getBoolean(HoodieKafkaAvroDeserializationConfig.SPECIFIC_AVRO_READER_CONFIG);
+  }
+
+  protected Object deserialize(byte[] payload) throws SerializationException {
+return deserialize(null, null, payload, targetSchema);
+  }
+
+  /**
+   * Just like single-parameter version but accepts an Avro schema to use for 
reading.
+   *
+   * @param payload serialized data
+   * @param readerSchema schema to use for Avro read (optional, enables Avro 
projection)
+   * @return the deserialized object
+   */
+  protected Object deserialize(byte[] payload, Schema readerSchema) throws 
SerializationException {
+return deserialize(null, null, payload, readerSchema);
+  }
+
+  protected Object deserialize(String topic, Boolean isKey, byte[] payload, 
Schema readerSchema) {
+try {
+  ByteBuffer buffer = this.getByteBuffer(payload);
+  int id = buffer.getInt();

Review comment:
   this assumes the message starts with schema version (code looks like 
from Confluent deserializer). It doesn't belong to 
AbstractHoodieKafkaAvroDeserializer.java 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2402: [HUDI-1383] Fixing sorting of partition vals for hive sync computation

2021-01-05 Thread GitBox



nsivabalan commented on a change in pull request #2402:
URL: https://github.com/apache/hudi/pull/2402#discussion_r552046190



##
File path: 
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
##
@@ -21,10 +21,10 @@
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.testutils.SchemaTestUtil;
 import org.apache.hudi.common.util.Option;
-import org.apache.hudi.sync.common.AbstractSyncHoodieClient.PartitionEvent;

Review comment:
   I did the usual reformatting from within intellij. I didn't explicitly 
fix any of these. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1507) Hive sync having issues w/ Clustering

2021-01-05 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259040#comment-17259040
 ] 

sivabalan narayanan commented on HUDI-1507:
---

CC : [~satish] 

> Hive sync having issues w/ Clustering
> -
>
> Key: HUDI-1507
> URL: https://issues.apache.org/jira/browse/HUDI-1507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Priority: Major
>
> I was trying out clustering w/ test suite job and ran into hive sync issues.
>  
> 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
> 5522853c-653b-4d92-acf4-d299c263a77f
> 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
> instant time :20210105164505 clustering strategy 
> org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
>  clustering sort cols : _row_key, target partitions for clustering :: 0, 
> inline cluster max commit : 1
> 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
> 20210105164505
> 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
> \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}
> 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing
> org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
> replacecommit
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)
>  at 
> org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)
>  at 
> org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)
>  at 
> org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r551984532



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+public class HoodieClusteringJob {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieClusteringJob.class);
+  private final Config cfg;
+  private transient FileSystem fs;
+  private TypedProperties props;
+  private final JavaSparkContext jsc;
+
+  public HoodieClusteringJob(JavaSparkContext jsc, Config cfg) {
+this.cfg = cfg;
+this.jsc = jsc;
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+  }
+
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+final FileSystem fs = FSUtils.getFs(cfg.basePath, 
jsc.hadoopConfiguration());
+
+return UtilHelpers
+.readConfig(fs, new Path(cfg.propsFilePath), cfg.configs)
+.getConfig();
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+@Parameter(names = {"--table-name", "-tn"}, description = "Table name", 
required = true)
+public String tableName = null;
+@Parameter(names = {"--instant-time", "-it"}, description = "Clustering 
Instant time", required = true)
+public String clusteringInstantTime = null;
+@Parameter(names = {"--parallelism", "-pl"}, description = "Parallelism 
for hoodie insert", required = false)
+public int parallelism = 1;
+@Parameter(names = {"--spark-master", "-ms"}, description = "Spark 
master", required = false)
+public String sparkMaster = null;
+@Parameter(names = {"--spark-memory", "-sm"}, description = "spark memory 
to use", required = true)
+public String sparkMemory = null;
+@Parameter(names = {"--retry", "-rt"}, description = "number of retries", 
required = false)
+public int retry = 0;
+
+@Parameter(names = {"--schedule", "-sc"}, description = "Schedule 
clustering")
+public Boolean runSchedule = false;
+@Parameter(names = {"--help", "-h"}, help = true)
+public Boolean help = false;
+
+@Parameter(names = {"--props"}, description = "path to properties file on 
localfs or dfs, with configurations for "
++ "hoodie client for clustering")
+public String propsFilePath = null;
+
+@Parameter(names = {"--hoodie-conf"}, description = "Any configuration 
that can be set in the properties file "
++ "(using the CLI parameter \"--props\") can also be passed command 
line using this parameter. This can be repeated",
+splitter = IdentitySplitter.class)
+public List configs = new ArrayList<>();
+  }
+
+  public static void main(String[] args) {
+final Config cfg = new Config();
+JCommander cmd = new JCommander(cfg, null, args);
+if (cfg.help || args.length == 0) {
+  cmd.usage();
+  System.exit(1);
+}
+final JavaSparkContext jsc = UtilHelpers.buildSparkContext("clustering-" + 
cfg.tableName, cfg.sparkMaster, cfg.sparkMemory);
+HoodieClusteringJob clusteringJob = new HoodieClusteringJob(jsc, cfg);

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r551984722



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+public class HoodieClusteringJob {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieClusteringJob.class);
+  private final Config cfg;
+  private transient FileSystem fs;
+  private TypedProperties props;
+  private final JavaSparkContext jsc;
+
+  public HoodieClusteringJob(JavaSparkContext jsc, Config cfg) {
+this.cfg = cfg;
+this.jsc = jsc;
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+  }
+
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+final FileSystem fs = FSUtils.getFs(cfg.basePath, 
jsc.hadoopConfiguration());
+
+return UtilHelpers
+.readConfig(fs, new Path(cfg.propsFilePath), cfg.configs)
+.getConfig();
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+@Parameter(names = {"--table-name", "-tn"}, description = "Table name", 
required = true)
+public String tableName = null;
+@Parameter(names = {"--instant-time", "-it"}, description = "Clustering 
Instant time", required = true)

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r551984380



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+public class HoodieClusteringJob {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieClusteringJob.class);
+  private final Config cfg;
+  private transient FileSystem fs;
+  private TypedProperties props;
+  private final JavaSparkContext jsc;
+
+  public HoodieClusteringJob(JavaSparkContext jsc, Config cfg) {
+this.cfg = cfg;
+this.jsc = jsc;
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+  }
+
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+final FileSystem fs = FSUtils.getFs(cfg.basePath, 
jsc.hadoopConfiguration());
+
+return UtilHelpers
+.readConfig(fs, new Path(cfg.propsFilePath), cfg.configs)
+.getConfig();
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+@Parameter(names = {"--table-name", "-tn"}, description = "Table name", 
required = true)
+public String tableName = null;
+@Parameter(names = {"--instant-time", "-it"}, description = "Clustering 
Instant time", required = true)
+public String clusteringInstantTime = null;
+@Parameter(names = {"--parallelism", "-pl"}, description = "Parallelism 
for hoodie insert", required = false)
+public int parallelism = 1;
+@Parameter(names = {"--spark-master", "-ms"}, description = "Spark 
master", required = false)
+public String sparkMaster = null;
+@Parameter(names = {"--spark-memory", "-sm"}, description = "spark memory 
to use", required = true)
+public String sparkMemory = null;
+@Parameter(names = {"--retry", "-rt"}, description = "number of retries", 
required = false)
+public int retry = 0;
+
+@Parameter(names = {"--schedule", "-sc"}, description = "Schedule 
clustering")
+public Boolean runSchedule = false;
+@Parameter(names = {"--help", "-h"}, help = true)
+public Boolean help = false;
+
+@Parameter(names = {"--props"}, description = "path to properties file on 
localfs or dfs, with configurations for "
++ "hoodie client for clustering")
+public String propsFilePath = null;
+
+@Parameter(names = {"--hoodie-conf"}, description = "Any configuration 
that can be set in the properties file "
++ "(using the CLI parameter \"--props\") can also be passed command 
line using this parameter. This can be repeated",
+splitter = IdentitySplitter.class)
+public List configs = new ArrayList<>();
+  }
+
+  public static void main(String[] args) {
+final Config cfg = new Config();
+JCommander cmd = new JCommander(cfg, null, args);
+if (cfg.help || args.length == 0) {
+  cmd.usage();
+  System.exit(1);
+}
+final JavaSparkContext jsc = UtilHelpers.buildSparkContext("clustering-" + 
cfg.tableName, cfg.sparkMaster, cfg.sparkMemory);
+HoodieClusteringJob clusteringJob = new HoodieClusteringJob(jsc, cfg);

[GitHub] [hudi] yanghua commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



yanghua commented on pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#issuecomment-754694548


   @wangxianghu And Travis failed, please check what's wrong...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2402: [HUDI-1383] Fixing sorting of partition vals for hive sync computation

2021-01-05 Thread GitBox



nsivabalan commented on a change in pull request #2402:
URL: https://github.com/apache/hudi/pull/2402#discussion_r55204



##
File path: 
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
##
@@ -56,7 +56,7 @@
   }
 
   private static Iterable useJdbcAndSchemaFromCommitMetadata() {
-return Arrays.asList(new Object[][] { { true, true }, { true, false }, { 
false, true }, { false, false } });
+return Arrays.asList(new Object[][] {{true, true}, {true, false}, {false, 
true}, {false, false}});

Review comment:
   @n3nash : same comment as above. Can you go to this file in one of your 
branch and apply reformatting and let me know if you see similar changes. If 
not, I will looking into my reformatting is diff.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r551986565



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java
##
@@ -682,6 +693,58 @@ public void testInlineClustering() throws Exception {
 });
   }
 
+  private HoodieClusteringJob.Config buildHoodieClusteringUtilConfig(String 
basePath,
+  String 
clusteringInstantTime, boolean runSchedule) {
+HoodieClusteringJob.Config config = new HoodieClusteringJob.Config();
+config.basePath = basePath;
+config.clusteringInstantTime = clusteringInstantTime;
+config.runSchedule = runSchedule;
+config.propsFilePath = dfsBasePath + "/clusteringjob.properties";
+return config;
+  }
+
+  @Test
+  public void testHoodieAsyncClusteringJob() throws Exception {
+String tableBasePath = dfsBasePath + "/asyncClustering";
+// Keep it higher than batch-size to test continuous mode
+int totalRecords = 3000;
+
+// Initial bulk insert
+HoodieDeltaStreamer.Config cfg = TestHelpers.makeConfig(tableBasePath, 
WriteOperationType.UPSERT);
+cfg.continuousMode = true;
+cfg.tableType = HoodieTableType.COPY_ON_WRITE.name();
+cfg.configs.add(String.format("%s=%d", 
SourceConfigs.MAX_UNIQUE_RECORDS_PROP, totalRecords));
+cfg.configs.add(String.format("%s=false", 
HoodieCompactionConfig.AUTO_CLEAN_PROP));
+cfg.configs.add(String.format("%s=true", 
HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE_OPT_KEY));
+HoodieDeltaStreamer ds = new HoodieDeltaStreamer(cfg, jsc);
+deltaStreamerTestRunner(ds, cfg, (r) -> {
+  TestHelpers.assertAtLeastNCommits(2, tableBasePath, dfs);
+  // for not confiict with delta streamer commit, just add 3600s
+  String clusterInstantTime = HoodieActiveTimeline.COMMIT_FORMATTER
+  .format(new Date(System.currentTimeMillis() + 3600 * 1000));
+  LOG.info("Cluster instant time " + clusterInstantTime);
+  HoodieClusteringJob.Config scheduleClusteringConfig = 
buildHoodieClusteringUtilConfig(tableBasePath,
+  clusterInstantTime, true);
+  HoodieClusteringJob scheduleClusteringJob = new HoodieClusteringJob(jsc, 
scheduleClusteringConfig);
+  int scheduleClusteringResult = 
scheduleClusteringJob.cluster(scheduleClusteringConfig.retry);
+  if (scheduleClusteringResult == 0) {
+LOG.info("Schedule clustering success, now cluster");
+HoodieClusteringJob.Config clusterClusteringConfig = 
buildHoodieClusteringUtilConfig(tableBasePath,
+clusterInstantTime, false);
+HoodieClusteringJob clusterClusteringJob = new 
HoodieClusteringJob(jsc, clusterClusteringConfig);
+clusterClusteringJob.cluster(clusterClusteringConfig.retry);
+LOG.info("Cluster success");
+  } else {
+LOG.warn("Schedule clustering failed");
+  }
+  HoodieTableMetaClient metaClient =  new 
HoodieTableMetaClient(this.dfs.getConf(), tableBasePath, true);
+  int pendingReplaceSize = 
metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants().toArray().length;
+  int completeReplaceSize = 
metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().toArray().length;
+  System.out.println("PendingReplaceSize=" + pendingReplaceSize + 
",completeReplaceSize = " + completeReplaceSize);
+  return completeReplaceSize > 0;

Review comment:
   Because if always completeReplaceSize <= 0 the runner will throw time 
out exception.Now i add the assert for completeReplaceSize == 1. As the unit 
test mainly test async clustering schedule and cluster, just assert 
completeReplaceSize will be ok. For records check can cover in cluster unit 
test. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on pull request #2404: [MINOR] Add Jira URL and Mailing List

2021-01-05 Thread GitBox



yanghua commented on pull request #2404:
URL: https://github.com/apache/hudi/pull/2404#issuecomment-754682021


   @vinothchandar Do you agree with this change?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] afilipchik commented on pull request #2380: [Hudi 73] Adding support for vanilla AvroKafkaSource

2021-01-05 Thread GitBox



afilipchik commented on pull request #2380:
URL: https://github.com/apache/hudi/pull/2380#issuecomment-754744538


   On making AbstractHoodieKafkaAvroDeserializer abstract - it looks like 
modified Confluent deserializer, so it believe it should be called like that. 
If we want to support Confluent schema registry we need to use schema id to 
acquire writer's schema, otherwise schema evolutions will be a pain.
   
   I.E. to deserialize an avro message we need 2 things: schema it was written 
with (can come from properties, but with Confluent id in the beginning of the 
message tells you exact version and can be used to fetch it from the schema 
registry) and the reader schema (schema on the reader side which comes from the 
schema provider)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1508) Partition update with global index in MOR tables resulting in duplicate values during read optimized queries

2021-01-05 Thread Ryan Pifer (Jira)

Ryan Pifer created HUDI-1508:


 Summary: Partition update with global index in MOR tables 
resulting in duplicate values during read optimized queries
 Key: HUDI-1508
 URL: https://issues.apache.org/jira/browse/HUDI-1508
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ryan Pifer


The way Hudi handles updating partition path is by locating the existing record 
and performing a delete on the previous partition and performing insert on new 
partition. In the case of Merge-on-Read tables the delete operation, and any 
update operation, is added as a log file. However since an insert occurs in the 
new partition the record is added in a parquet file. Querying using 
`QUERY_TYPE_READ_OPTIMIZED_OPT_VAL` fetches only parquet files and now we have 
the case where 2 records for given primary key are present



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1507) Hive sync having issues w/ Clustering

2021-01-05 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish reassigned HUDI-1507:


Assignee: satish

> Hive sync having issues w/ Clustering
> -
>
> Key: HUDI-1507
> URL: https://issues.apache.org/jira/browse/HUDI-1507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: satish
>Priority: Major
>
> I was trying out clustering w/ test suite job and ran into hive sync issues.
>  
> 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
> 5522853c-653b-4d92-acf4-d299c263a77f
> 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
> instant time :20210105164505 clustering strategy 
> org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
>  clustering sort cols : _row_key, target partitions for clustering :: 0, 
> inline cluster max commit : 1
> 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
> 20210105164505
> 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
> \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}
> 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing
> org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
> replacecommit
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)
>  at 
> org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)
>  at 
> org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)
>  at 
> org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1399:
-
Status: Patch Available  (was: In Progress)

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2400: Some fixes and enhancements to test suite framework

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2400:
URL: https://github.com/apache/hudi/pull/2400#issuecomment-753557036


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=h1) Report
   > Merging 
[#2400](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=desc) (ab40bd6) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.20%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2400/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2400   +/-   ##
   =
   - Coverage 50.23%   10.03%   -40.21% 
   + Complexity 2985   48 -2937 
   =
 Files   410   52  -358 
 Lines 18398 1854-16544 
 Branches   1884  224 -1660 
   =
   - Hits   9242  186 -9056 
   + Misses 8398 1655 -6743 
   + Partials758   13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.03% <0.00%> (-59.63%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2400?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2400/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%>

[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering

2021-01-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1507:
-
Labels: pull-request-available  (was: )

> Hive sync having issues w/ Clustering
> -
>
> Key: HUDI-1507
> URL: https://issues.apache.org/jira/browse/HUDI-1507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>
> I was trying out clustering w/ test suite job and ran into hive sync issues.
>  
> 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
> 5522853c-653b-4d92-acf4-d299c263a77f
> 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
> instant time :20210105164505 clustering strategy 
> org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
>  clustering sort cols : _row_key, target partitions for clustering :: 0, 
> inline cluster max commit : 1
> 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
> 20210105164505
> 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
> \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}
> 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing
> org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
> replacecommit
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)
>  at 
> org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)
>  at 
> org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)
>  at 
> org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] satishkotha opened a new pull request #2407: [HUDI-1507] Change timeline utils to support reading replacecommit

2021-01-05 Thread GitBox



satishkotha opened a new pull request #2407:
URL: https://github.com/apache/hudi/pull/2407


   ## What is the purpose of the pull request

   Change timeline utils to support reading replacecommit metadata
   
   ## Brief change log
   
   HiveSync uses TimelineUtils to get modified partitions. Add support for 
replacecommit in TimelineUtils.
   
   ## Verify this pull request
   This change added tests
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] WTa-hash commented on issue #2229: [SUPPORT] UpsertPartitioner performance

2021-01-05 Thread GitBox



WTa-hash commented on issue #2229:
URL: https://github.com/apache/hudi/issues/2229#issuecomment-754894794


   @bvaradar - I would like to understand a little bit more about what's going 
on here with the spark stage "Getting small files from partitions" from the 
screenshot.
   
   
![img](https://user-images.githubusercontent.com/64644025/103697708-e956c080-4f65-11eb-9fdf-1e36d2165d5e.PNG)
   
   In the executor logs, I see the following:
   `2021-01-05 20:26:52,176 INFO [dispatcher-event-loop-1] 
org.apache.spark.executor.CoarseGrainedExecutorBackend:Got assigned task 4686
   2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] 
org.apache.spark.executor.Executor:Running task 0.0 in stage 701.0 (TID 4686)
   2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Updating epoch to 202 and clearing cache
   2021-01-05 20:26:52,177 INFO [Executor task launch worker for task 4686] 
org.apache.spark.broadcast.TorrentBroadcast:Started reading broadcast variable 
502
   2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block broadcast_502_piece0 stored 
as bytes in memory (estimated size 196.8 KB, free 4.3 GB)
   2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] 
org.apache.spark.broadcast.TorrentBroadcast:Reading broadcast variable 502 took 
1 ms
   2021-01-05 20:26:52,180 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block broadcast_502 stored as 
values in memory (estimated size 637.1 KB, free 4.3 GB)
   2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Don't have map outputs for shuffle 201, 
fetching them
   2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Doing the fetch; tracker endpoint = 
NettyRpcEndpointRef(spark://MapOutputTracker@ip-xxx-xx-xxx-xxx:35039)
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Got the output locations
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.ShuffleBlockFetcherIterator:Getting 18 non-empty 
blocks including 6 local blocks and 12 remote blocks
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.ShuffleBlockFetcherIterator:Started 2 remote fetches 
in 0 ms
   2021-01-05 20:26:53,287 INFO [Executor task launch worker for task 4686] 
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false 
s3://...table/.hoodie/.temp/20210105202638/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet.marker.MERGE
   2021-01-05 20:26:53,384 INFO [pool-22-thread-1] 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem:Opening 
's3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-659-4407_20210105202508.parquet'
 for reading
   2021-01-05 20:27:40,841 INFO [Executor task launch worker for task 4686] 
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false 
s3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet
   2021-01-05 20:27:41,708 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block rdd_2026_0 stored as values 
in memory (estimated size 390.0 B, free 4.3 GB)
   2021-01-05 20:27:41,713 INFO [Executor task launch worker for task 4686] 
org.apache.spark.executor.Executor:Finished task 0.0 in stage 701.0 (TID 4686). 
3169 bytes result sent to driver`
   
   Does this mean the 50 seconds for this task is used to create a SINGLE new 
parquet file with the new data using the "small parquet file" as its base? If 
my thoughts are correct here, is there a configuration to split the records 
evenly into each executor so that you can have multiple writes occurring in 
parallel?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] WTa-hash edited a comment on issue #2229: [SUPPORT] UpsertPartitioner performance

2021-01-05 Thread GitBox



WTa-hash edited a comment on issue #2229:
URL: https://github.com/apache/hudi/issues/2229#issuecomment-754894794


   @bvaradar - I would like to understand a little bit more about what's going 
on here with the spark stage "Getting small files from partitions" from the 
screenshot.
   
   
![img](https://user-images.githubusercontent.com/64644025/103697708-e956c080-4f65-11eb-9fdf-1e36d2165d5e.PNG)
   
   In the executor logs, I see the following:
   `
   2021-01-05 20:26:52,176 INFO [dispatcher-event-loop-1] 
org.apache.spark.executor.CoarseGrainedExecutorBackend:Got assigned task 4686
   2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] 
org.apache.spark.executor.Executor:Running task 0.0 in stage 701.0 (TID 4686)
   2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Updating epoch to 202 and clearing cache
   2021-01-05 20:26:52,177 INFO [Executor task launch worker for task 4686] 
org.apache.spark.broadcast.TorrentBroadcast:Started reading broadcast variable 
502
   2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block broadcast_502_piece0 stored 
as bytes in memory (estimated size 196.8 KB, free 4.3 GB)
   2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] 
org.apache.spark.broadcast.TorrentBroadcast:Reading broadcast variable 502 took 
1 ms
   2021-01-05 20:26:52,180 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block broadcast_502 stored as 
values in memory (estimated size 637.1 KB, free 4.3 GB)
   2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Don't have map outputs for shuffle 201, 
fetching them
   2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Doing the fetch; tracker endpoint = 
NettyRpcEndpointRef(spark://MapOutputTracker@ip-xxx-xx-xxx-xxx:35039)
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Got the output locations
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.ShuffleBlockFetcherIterator:Getting 18 non-empty 
blocks including 6 local blocks and 12 remote blocks
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.ShuffleBlockFetcherIterator:Started 2 remote fetches 
in 0 ms
   2021-01-05 20:26:53,287 INFO [Executor task launch worker for task 4686] 
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false 
s3://...table/.hoodie/.temp/20210105202638/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet.marker.MERGE
   2021-01-05 20:26:53,384 INFO [pool-22-thread-1] 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem:Opening 
's3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-659-4407_20210105202508.parquet'
 for reading
   2021-01-05 20:27:40,841 INFO [Executor task launch worker for task 4686] 
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false 
s3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet
   2021-01-05 20:27:41,708 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block rdd_2026_0 stored as values 
in memory (estimated size 390.0 B, free 4.3 GB)
   2021-01-05 20:27:41,713 INFO [Executor task launch worker for task 4686] 
org.apache.spark.executor.Executor:Finished task 0.0 in stage 701.0 (TID 4686). 
3169 bytes result sent to driver
   `
   
   Does this mean the 50 seconds for this task is used to create a SINGLE new 
parquet file with the new data using the "small parquet file" as its base? If 
my thoughts are correct here, is there a configuration to split the records 
evenly into each executor so that you can have multiple writes occurring in 
parallel?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



satishkotha commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r552133598



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -109,6 +111,9 @@ public static void main(String[] args) {
 if (result == -1) {
   LOG.error(resultMsg + " failed");
 } else {
+  if (cfg.runSchedule) {
+System.out.println("The schedule instant time is " + 
cfg.clusteringInstantTime);

Review comment:
   Can you change this to a LOG message?

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -153,7 +161,12 @@ private int doSchedule(JavaSparkContext jsc) throws 
Exception {
 String schemaStr = getSchemaFromLatestInstant();
 SparkRDDWriteClient client =
 UtilHelpers.createHoodieClient(jsc, cfg.basePath, schemaStr, 
cfg.parallelism, Option.empty(), props);
-return client.scheduleClusteringAtInstant(cfg.clusteringInstantTime, 
Option.empty()) ? 0 : -1;
+Option instantTime = client.scheduleClustering(Option.empty());
+if (instantTime.isPresent()) {
+  cfg.clusteringInstantTime = instantTime.get();

Review comment:
   changing config at this stage seems a little awkward. Do you think its 
better to return Option instantTime from this method? Or maybe just add 
the log line here with instantTime?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] WTa-hash edited a comment on issue #2229: [SUPPORT] UpsertPartitioner performance

2021-01-05 Thread GitBox



WTa-hash edited a comment on issue #2229:
URL: https://github.com/apache/hudi/issues/2229#issuecomment-754894794


   @bvaradar - I would like to understand a little bit more about what's going 
on here with the spark stage "Getting small files from partitions" from the 
screenshot.
   
   
![img](https://user-images.githubusercontent.com/64644025/103697708-e956c080-4f65-11eb-9fdf-1e36d2165d5e.PNG)
   
   In the executor logs, I see the following:
   `2021-01-05 20:26:52,176 INFO [dispatcher-event-loop-1] 
org.apache.spark.executor.CoarseGrainedExecutorBackend:Got assigned task 4686
   
   2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] 
org.apache.spark.executor.Executor:Running task 0.0 in stage 701.0 (TID 4686)
   
   2021-01-05 20:26:52,176 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Updating epoch to 202 and clearing cache
   
   2021-01-05 20:26:52,177 INFO [Executor task launch worker for task 4686] 
org.apache.spark.broadcast.TorrentBroadcast:Started reading broadcast variable 
502
   
   2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block broadcast_502_piece0 stored 
as bytes in memory (estimated size 196.8 KB, free 4.3 GB)
   
   2021-01-05 20:26:52,178 INFO [Executor task launch worker for task 4686] 
org.apache.spark.broadcast.TorrentBroadcast:Reading broadcast variable 502 took 
1 ms
   
   2021-01-05 20:26:52,180 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block broadcast_502 stored as 
values in memory (estimated size 637.1 KB, free 4.3 GB)
   
   2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Don't have map outputs for shuffle 201, 
fetching them
   
   2021-01-05 20:26:52,198 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Doing the fetch; tracker endpoint = 
NettyRpcEndpointRef(spark://MapOutputTracker@ip-xxx-xx-xxx-xxx:35039)
   
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.MapOutputTrackerWorker:Got the output locations
   
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.ShuffleBlockFetcherIterator:Getting 18 non-empty 
blocks including 6 local blocks and 12 remote blocks
   
   2021-01-05 20:26:52,199 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.ShuffleBlockFetcherIterator:Started 2 remote fetches 
in 0 ms
   
   2021-01-05 20:26:53,287 INFO [Executor task launch worker for task 4686] 
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false 
s3://...table/.hoodie/.temp/20210105202638/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet.marker.MERGE
   
   2021-01-05 20:26:53,384 INFO [pool-22-thread-1] 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem:Opening 
's3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-659-4407_20210105202508.parquet'
 for reading
   
   2021-01-05 20:27:40,841 INFO [Executor task launch worker for task 4686] 
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream:close closed:false 
s3://...table/1900-01-01/6ac65ea6-5378-4022-9a54-dfda75d6b53d-0_0-701-4686_20210105202638.parquet
   
   2021-01-05 20:27:41,708 INFO [Executor task launch worker for task 4686] 
org.apache.spark.storage.memory.MemoryStore:Block rdd_2026_0 stored as values 
in memory (estimated size 390.0 B, free 4.3 GB)
   
   2021-01-05 20:27:41,713 INFO [Executor task launch worker for task 4686] 
org.apache.spark.executor.Executor:Finished task 0.0 in stage 701.0 (TID 4686). 
3169 bytes result sent to driver
   `
   
   Does this mean the 50 seconds for this task is used to create a SINGLE new 
parquet file with the new data using the "small parquet file" as its base? If 
my thoughts are correct here, is there a configuration to split the records 
evenly into each executor so that you can have multiple writes occurring in 
parallel?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1509:
-
Description: 
During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a 
performance degradation for writes into HUDI. I have traced the issue due to 
the changes in the following commit

[[HUDI-727]: Copy default values of fields if not present when rewriting 
incoming record with new 
schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]

I wrote a unit test to reduce the scope of testing as follows:
# Take an existing parquet file from production dataset (size=690MB, 
#records=960K)
# Read all the records from this parquet into a JavaRDD
# Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)

The above scenario is directly taken from our production pipelines where each 
executor will ingest about a million record creating a single parquet file in a 
COW dataset. This is bulk insert only dataset.

The time to complete the bulk insert prepped *decreased from 680seconds to 
380seconds* when I reverted the above commit. 


Schema details: This HUDI dataset uses a large schema with 51 fields in the 
record.

  was:
During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a 
performance degradation for writes into HUDI. I have traced the issue due to 
the changes in the following commit

[[HUDI-727]: Copy default values of fields if not present when rewriting 
incoming record with new 
schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]

I wrote a unit test to reduce the scope of testing as follows:
# Take an existing parquet file from production dataset (size=690MB, 
#records=960K)
# Read all the records from this parquet into a JavaRDD
# Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)

The above scenario is directly taken from our production pipelines where each 
executor will ingest about a million record creating a single parquet file in a 
COW dataset. This is bulk insert only dataset.

The time to complete the bulk insert prepped *decreased from 680seconds to 
380seconds* when I reverted the above commit. 



> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323
 ] 

Prashant Wason commented on HUDI-1509:
--

I timed the various code fragments involved in the above commit. The timings 
are as follows:

  private static LinkedHashSet getCombinedFieldsToWrite(Schema 
oldSchema, Schema newSchema) {
LinkedHashSet allFields = new 
LinkedHashSet<>(oldSchema.getFields()); * // 75usec average for this line*

*// 200usec average for the lines below *
 for (Schema.Field f : newSchema.getFields()) {
  if (!allFields.contains(f) && !isMetadataField(f.name())) {
allFields.add(f);
  }
}
return allFields;
  }





private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
GenericRecord newRecord = new GenericData.Record(newSchema);
for (Schema.Field f : fieldsToWrite) {
  if (record.get(f.name()) == null) {
if (f.defaultVal() instanceof JsonProperties.Null) {
  newRecord.put(f.name(), null);
} else {
  newRecord.put(f.name(), f.defaultVal());
}
  } else {
newRecord.put(f.name(), record.get(f.name()));
  }
}
*// 3usec for the code above*

*// 75usec for the code below*

if (!GenericData.get().validate(newSchema, newRecord)) {
  throw new SchemaCompatibilityException(
  "Unable to validate the rewritten record " + record + " against 
schema " + newSchema);
}
return newRecord;
  }

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)

Prashant Wason created HUDI-1509:


 Summary: Major performance degradation due to rewriting records 
with default values
 Key: HUDI-1509
 URL: https://issues.apache.org/jira/browse/HUDI-1509
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Prashant Wason


During the in-house testing for 0.5x to 0.6x release upgrade, I have detected a 
performance degradation for writes into HUDI. I have traced the issue due to 
the changes in the following commit

[[HUDI-727]: Copy default values of fields if not present when rewriting 
incoming record with new 
schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]

I wrote a unit test to reduce the scope of testing as follows:
# Take an existing parquet file from production dataset (size=690MB, 
#records=960K)
# Read all the records from this parquet into a JavaRDD
# Time the call HoodieWriteClient.bulkInsertPrepped(). (bulkInsertParallelism=1)

The above scenario is directly taken from our production pipelines where each 
executor will ingest about a million record creating a single parquet file in a 
COW dataset. This is bulk insert only dataset.

The time to complete the bulk insert prepped *decreased from 680seconds to 
380seconds* when I reverted the above commit. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323
 ] 

Prashant Wason edited comment on HUDI-1509 at 1/6/21, 12:52 AM:


I timed the various code fragments involved in the above commit. The timings 
are as follows:

private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, 
Schema newSchema) {
 LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); 
** // 75usec average for this line**
 * *// 200usec average for the lines below **
 for (Schema.Field f : newSchema.getFields())
Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { 
allFields.add(f); } }
return allFields;
 }

private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
 if (record.get(f.name()) == null) {
 if (f.defaultVal() instanceof JsonProperties.Null)

{ newRecord.put(f.name(), null); }

else

{ newRecord.put(f.name(), f.defaultVal()); }

} else

{ newRecord.put(f.name(), record.get(f.name())); }

}
 *// 3usec for the code above*

*// 75usec for the code below*

if (!GenericData.get().validate(newSchema, newRecord))

{ throw new SchemaCompatibilityException( "Unable to validate the rewritten 
record " + record + " against schema " + newSchema); }

return newRecord;
 }


was (Author: pwason):
I timed the various code fragments involved in the above commit. The timings 
are as follows:

  private static LinkedHashSet getCombinedFieldsToWrite(Schema 
oldSchema, Schema newSchema) {
LinkedHashSet allFields = new 
LinkedHashSet<>(oldSchema.getFields()); * // 75usec average for this line*

*// 200usec average for the lines below *
 for (Schema.Field f : newSchema.getFields()) {
  if (!allFields.contains(f) && !isMetadataField(f.name())) {
allFields.add(f);
  }
}
return allFields;
  }





private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
GenericRecord newRecord = new GenericData.Record(newSchema);
for (Schema.Field f : fieldsToWrite) {
  if (record.get(f.name()) == null) {
if (f.defaultVal() instanceof JsonProperties.Null) {
  newRecord.put(f.name(), null);
} else {
  newRecord.put(f.name(), f.defaultVal());
}
  } else {
newRecord.put(f.name(), record.get(f.name()));
  }
}
*// 3usec for the code above*

*// 75usec for the code below*

if (!GenericData.get().validate(newSchema, newRecord)) {
  throw new SchemaCompatibilityException(
  "Unable to validate the rewritten record " + record + " against 
schema " + newSchema);
}
return newRecord;
  }

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323
 ] 

Prashant Wason edited comment on HUDI-1509 at 1/6/21, 12:52 AM:


I timed the various code fragments involved in the above commit. The timings 
are as follows:

private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, 
Schema newSchema) {
 LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); 
{color:#FF}*** // 75usec average for this line***{color}

{color:#FF}**// 200usec average for the lines below ***{color}
 for (Schema.Field f : newSchema.getFields())
 Unknown macro: { if (!allFields.contains(f) && !isMetadataField(f.name()))

{ allFields.add(f); }

}
 return allFields;
 }

private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
 if (record.get(f.name()) == null) {
 if (f.defaultVal() instanceof JsonProperties.Null)

{ newRecord.put(f.name(), null); }

else

{ newRecord.put(f.name(), f.defaultVal()); }

} else

{ newRecord.put(f.name(), record.get(f.name())); }

}
 {color:#FF}*// 3usec for the code above*{color}

{color:#FF}*// 75usec for the code below*{color}

if (!GenericData.get().validate(newSchema, newRecord))

{ throw new SchemaCompatibilityException( "Unable to validate the rewritten 
record " + record + " against schema " + newSchema); }

return newRecord;
 }


was (Author: pwason):
I timed the various code fragments involved in the above commit. The timings 
are as follows:

private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, 
Schema newSchema) {
 LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); 
** // 75usec average for this line**

*// 200usec average for the lines below **
 for (Schema.Field f : newSchema.getFields())
 Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { 
allFields.add(f); }

}
 return allFields;
 }

private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
 if (record.get(f.name()) == null) {
 if (f.defaultVal() instanceof JsonProperties.Null)

{ newRecord.put(f.name(), null); }

else

{ newRecord.put(f.name(), f.defaultVal()); }

} else

{ newRecord.put(f.name(), record.get(f.name())); }

}
 *// 3usec for the code above*

*// 75usec for the code below*

if (!GenericData.get().validate(newSchema, newRecord))

{ throw new SchemaCompatibilityException( "Unable to validate the rewritten 
record " + record + " against schema " + newSchema); }

return newRecord;
 }

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259323#comment-17259323
 ] 

Prashant Wason edited comment on HUDI-1509 at 1/6/21, 12:52 AM:


I timed the various code fragments involved in the above commit. The timings 
are as follows:

private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, 
Schema newSchema) {
 LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); 
** // 75usec average for this line**

*// 200usec average for the lines below **
 for (Schema.Field f : newSchema.getFields())
 Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { 
allFields.add(f); }

}
 return allFields;
 }

private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
 if (record.get(f.name()) == null) {
 if (f.defaultVal() instanceof JsonProperties.Null)

{ newRecord.put(f.name(), null); }

else

{ newRecord.put(f.name(), f.defaultVal()); }

} else

{ newRecord.put(f.name(), record.get(f.name())); }

}
 *// 3usec for the code above*

*// 75usec for the code below*

if (!GenericData.get().validate(newSchema, newRecord))

{ throw new SchemaCompatibilityException( "Unable to validate the rewritten 
record " + record + " against schema " + newSchema); }

return newRecord;
 }


was (Author: pwason):
I timed the various code fragments involved in the above commit. The timings 
are as follows:

private static LinkedHashSet getCombinedFieldsToWrite(Schema oldSchema, 
Schema newSchema) {
 LinkedHashSet allFields = new LinkedHashSet<>(oldSchema.getFields()); 
** // 75usec average for this line**
 * *// 200usec average for the lines below **
 for (Schema.Field f : newSchema.getFields())
Unknown macro: \{ if (!allFields.contains(f) && !isMetadataField(f.name())) { 
allFields.add(f); } }
return allFields;
 }

private static GenericRecord rewrite(GenericRecord record, LinkedHashSet 
fieldsToWrite, Schema newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
 if (record.get(f.name()) == null) {
 if (f.defaultVal() instanceof JsonProperties.Null)

{ newRecord.put(f.name(), null); }

else

{ newRecord.put(f.name(), f.defaultVal()); }

} else

{ newRecord.put(f.name(), record.get(f.name())); }

}
 *// 3usec for the code above*

*// 75usec for the code below*

if (!GenericData.get().validate(newSchema, newRecord))

{ throw new SchemaCompatibilityException( "Unable to validate the rewritten 
record " + record + " against schema " + newSchema); }

return newRecord;
 }

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io commented on pull request #2407: [HUDI-1507] Change timeline utils to support reading replacecommit

2021-01-05 Thread GitBox



codecov-io commented on pull request #2407:
URL: https://github.com/apache/hudi/pull/2407#issuecomment-754918776


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=h1) Report
   > Merging 
[#2407](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=desc) (88ff431) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2407/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2407   +/-   ##
   =
   - Coverage 50.23%   10.04%   -40.20% 
   + Complexity 2985   48 -2937 
   =
 Files   410   52  -358 
 Lines 18398 1852-16546 
 Branches   1884  223 -1661 
   =
   - Hits   9242  186 -9056 
   + Misses 8398 1653 -6745 
   + Partials758   13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2407?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2407/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   |

[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering

2021-01-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1507:
-
Labels: pull-request-available release-blocker  (was: release-blocker)

> Hive sync having issues w/ Clustering
> -
>
> Key: HUDI-1507
> URL: https://issues.apache.org/jira/browse/HUDI-1507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available, release-blocker
>
> I was trying out clustering w/ test suite job and ran into hive sync issues.
>  
> 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
> 5522853c-653b-4d92-acf4-d299c263a77f
> 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
> instant time :20210105164505 clustering strategy 
> org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
>  clustering sort cols : _row_key, target partitions for clustering :: 0, 
> inline cluster max commit : 1
> 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
> 20210105164505
> 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
> \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}
> 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing
> org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
> replacecommit
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)
>  at 
> org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)
>  at 
> org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)
>  at 
> org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering

2021-01-05 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1507:
--
Labels: release-blocker  (was: pull-request-available)

> Hive sync having issues w/ Clustering
> -
>
> Key: HUDI-1507
> URL: https://issues.apache.org/jira/browse/HUDI-1507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: satish
>Priority: Major
>  Labels: release-blocker
>
> I was trying out clustering w/ test suite job and ran into hive sync issues.
>  
> 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
> 5522853c-653b-4d92-acf4-d299c263a77f
> 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
> instant time :20210105164505 clustering strategy 
> org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
>  clustering sort cols : _row_key, target partitions for clustering :: 0, 
> inline cluster max commit : 1
> 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
> 20210105164505
> 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
> \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}
> 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing
> org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
> replacecommit
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)
>  at 
> org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)
>  at 
> org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)
>  at 
> org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] jtmzheng opened a new issue #2408: [SUPPORT] OutOfMemory on upserting into MOR dataset

2021-01-05 Thread GitBox



jtmzheng opened a new issue #2408:
URL: https://github.com/apache/hudi/issues/2408


   **Describe the problem you faced**
   
   We have a Spark Streaming application running on EMR 5.31.0 that reads from 
a Kinesis stream (batch interval of 30 minutes) and upserts to a MOR dataset 
that is partitioned by date. This dataset is ~ 2.2 TB in size and ~ 6 billion 
records. Our problem is the application is now persistently crashing on an 
OutOfMemory error regardless of the batch input size (stacktrace attached below 
is for an input of ~ 1 million records and size ~ 250 mb). For debugging we've 
tried replacing the Hudi upsert with a simple count and afterwards there seems 
to be minimal memory usage by the application based on the Spark UI.
   
   ``` 
df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(OUTPUT_PATH)```
   
   This seems similar to 
https://github.com/apache/hudi/issues/1491#issuecomment-610626104 since it 
seems to be running out of memory on reading old records based on the 
stacktrace. There are a lot of large log files in the dataset:
   
   ```
   ...
   2021-01-01 00:28:08  910487106 
hudi/production/transactions_v9/2020/12/30/.0ef05182-e0ad-44b5-b52d-0412a6b44a01-1_20210101033018.log.1_3774-34-62086
   2021-01-01 21:03:26  910490317 
hudi/production/transactions_v9/2020/12/30/.0ef05182-e0ad-44b5-b52d-0412a6b44a01-1_20210101033018.log.11_3774-34-62109
   2021-01-01 16:52:39  910495970 
hudi/production/transactions_v9/2020/12/30/.0ef05182-e0ad-44b5-b52d-0412a6b44a01-1_20210101033018.log.9_3774-34-62083
   ```
   
   Our Hudi configs are:
   
   ```
   hudi_options = {
   "hoodie.table.name": "transactions",
   "hoodie.datasource.write.recordkey.field": "id.value",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.partitionpath.field": "year,month,day",
   "hoodie.datasource.write.table.name": "transactions",
   "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.consistency.check.enabled": "true",
   "hoodie.datasource.write.precombine.field": "publishedAtUnixNano",
   "hoodie.compact.inline": True,
   "hoodie.compact.inline.max.delta.commits": 10,
   "hoodie.cleaner.commits.retained": 1,
   }
   ```
   
   Our Spark configs are (CORES_PER_EXECUTOR = 5, NUM_EXECUTORS = 30) for a 
cluster running on 10 r5.4xlarge instances:
   
   ```
   "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
   "spark.sql.hive.convertMetastoreParquet": "false",
   # Recommended from 
https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
   "spark.executor.cores": CORES_PER_EXECUTOR,
   "spark.executor.memory": "36g",
   "spark.yarn.executor.memoryOverhead": "4g",
   "spark.driver.cores": CORES_PER_EXECUTOR,
   "spark.driver.memory": "36g",
   "spark.executor.instances": NUM_EXECUTORS,
   "spark.default.parallelism": NUM_EXECUTORS * CORES_PER_EXECUTOR 
* 2,
   "spark.dynamicAllocation.enabled": "false",
   "spark.streaming.dynamicAllocation.enabled": "false",
   "spark.streaming.backpressure.enabled": "true",
   # Set max rate limit per receiver to limit memory usage
   "spark.streaming.receiver.maxRate": "10",
   # Shutdown gracefully on JVM shutdown
   "spark.streaming.stopGracefullyOnShutdown": "true",
   # GC options 
(https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html)
   "spark.executor.defaultJavaOptions": "-XX:+UseG1GC 
-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark 
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12",
   "spark.driver.defaultJavaOptions": "-XX:+UseG1GC 
-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark 
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12",
   "spark.streaming.kinesis.retry.waitTime": "1000ms",  # default 
is 100 ms
   "spark.streaming.kinesis.retry.maxAttempts": "10",  # default is 
10
   # Set max retries on S3 rate limits
   "spark.hadoop.fs.s3.maxRetries": "20",
   ```
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.6 (EMR 5.31.0)
   
   * Hive version : Hive 2.3.7
   
   * Hadoop version : Amazon 2.10.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running

[jira] [Updated] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1509:
-
Fix Version/s: 0.7.0

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
> Fix For: 0.7.0
>
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259324#comment-17259324
 ] 

Prashant Wason commented on HUDI-1509:
--

So calling getCombinedFieldsToWrite() is adding 275usec for writing each record 
to ParquetWriter which takes approximately 200usec or so for the write call. 
Everything else remaining same this nearly double the time to complete the 
ingestion.

 

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Blocker
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1509:
-
Affects Version/s: 0.7.0
   0.6.1
   0.6.0

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.6.1, 0.7.0
>Reporter: Prashant Wason
>Priority: Blocker
> Fix For: 0.7.0
>
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1509) Major performance degradation due to rewriting records with default values

2021-01-05 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259332#comment-17259332
 ] 

Nishith Agarwal commented on HUDI-1509:
---

[~pwason] Thanks for digging into this and instrumenting the code in details. 

So the difference in the hash look up of `record.get(f.name())` vs `{ if 
(!allFields.contains(f) && !isMetadataField(f.name()))` is 197 usec ?

> Major performance degradation due to rewriting records with default values
> --
>
> Key: HUDI-1509
> URL: https://issues.apache.org/jira/browse/HUDI-1509
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.6.1, 0.7.0
>Reporter: Prashant Wason
>Priority: Blocker
> Fix For: 0.7.0
>
>
> During the in-house testing for 0.5x to 0.6x release upgrade, I have detected 
> a performance degradation for writes into HUDI. I have traced the issue due 
> to the changes in the following commit
> [[HUDI-727]: Copy default values of fields if not present when rewriting 
> incoming record with new 
> schema|https://github.com/apache/hudi/commit/6d7ca2cf7e441ad19d32d7a25739e454f39ed253]
> I wrote a unit test to reduce the scope of testing as follows:
> # Take an existing parquet file from production dataset (size=690MB, 
> #records=960K)
> # Read all the records from this parquet into a JavaRDD
> # Time the call HoodieWriteClient.bulkInsertPrepped(). 
> (bulkInsertParallelism=1)
> The above scenario is directly taken from our production pipelines where each 
> executor will ingest about a million record creating a single parquet file in 
> a COW dataset. This is bulk insert only dataset.
> The time to complete the bulk insert prepped *decreased from 680seconds to 
> 380seconds* when I reverted the above commit. 
> Schema details: This HUDI dataset uses a large schema with 51 fields in the 
> record.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r552329734



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -153,7 +161,12 @@ private int doSchedule(JavaSparkContext jsc) throws 
Exception {
 String schemaStr = getSchemaFromLatestInstant();
 SparkRDDWriteClient client =
 UtilHelpers.createHoodieClient(jsc, cfg.basePath, schemaStr, 
cfg.parallelism, Option.empty(), props);
-return client.scheduleClusteringAtInstant(cfg.clusteringInstantTime, 
Option.empty()) ? 0 : -1;
+Option instantTime = client.scheduleClustering(Option.empty());
+if (instantTime.isPresent()) {
+  cfg.clusteringInstantTime = instantTime.get();

Review comment:
   done, also add a method for test





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wosow opened a new issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-01-05 Thread GitBox



wosow opened a new issue #2409:
URL: https://github.com/apache/hudi/issues/2409


   
Spark structured Streaming writes to Hudi and synchronizes Hive to create 
only read-optimized tables without creating real-time tables   , no errors 
happening
   
   
   **Environment Description**
   
   * Hudi version :0.6.0
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.7.5
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   code as follows:
   batchDF.write.format("org.apache.hudi")
 .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "MERGE_ON_READ")
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert")
 
.option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, "10")
 .option("hoodie.datasource.compaction.async.enable", "true") 
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rec_id")
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
"modified")
 .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "ads") 
 .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, hiveTableName) 
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt") 
 .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dt")
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, 
"true")
 .option(HoodieWriteConfig.TABLE_NAME, hiveTableName)
 .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, 
"true") 
 .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") 
 .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, 
"jdbc:hive2://0.0.0.0:1") 
 .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, "")
 .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, "")
 
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
classOf[MultiPartKeysValueExtractor].getName)
 .option(HoodieIndexConfig.INDEX_TYPE_PROP, 
HoodieIndex.IndexType.GLOBAL_BLOOM.name())
 .option("hoodie.insert.shuffle.parallelism", "10")
 .option("hoodie.upsert.shuffle.parallelism", "10")
 .mode("append")
 .save("/data/mor/user")
   
   only create user_ro ,  no user_rt
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



yanghua commented on pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#issuecomment-754997547


   @wangxianghu Please check Travis again.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r552329605



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -109,6 +111,9 @@ public static void main(String[] args) {
 if (result == -1) {
   LOG.error(resultMsg + " failed");
 } else {
+  if (cfg.runSchedule) {
+System.out.println("The schedule instant time is " + 
cfg.clusteringInstantTime);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lw309637554 commented on a change in pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



lw309637554 commented on a change in pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#discussion_r552318635



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##
@@ -109,6 +111,9 @@ public static void main(String[] args) {
 if (result == -1) {
   LOG.error(resultMsg + " failed");
 } else {
+  if (cfg.runSchedule) {
+System.out.println("The schedule instant time is " + 
cfg.clusteringInstantTime);

Review comment:
   I think system.out.println will be better. Because println the instant 
time use system.out.println will output the string to spark stdout of UI.  If 
use LOG  by default the string will output to stderr with many other  logs.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-05 Thread GitBox



yanghua commented on pull request #2375:
URL: https://github.com/apache/hudi/pull/2375#issuecomment-755056830


   > > > Hi @garyli1019. Maybe I think the current implementation is OK. 
Beacause even in streaming job, we need to accumulate batch records in memory 
during the check-point cycle and upsert data into hudi-table when check-point 
triggers. WDYT?
   > > 
   > > 
   > > @Nieal-Yang sorry about the late reply. IMO The batch mode will work but 
not fully taking the advantage of flink. But we can optimize this step by step. 
To ensure this PR will work, would you add some unit tests to this PR?
   > 
   > okay. This index has been used in production in my company. We will 
optimize it gradually
   
   Hi @Nieal-Yang Would you please tell me what's the Flink version you are 
using?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ivorzhou commented on pull request #2091: HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table

2021-01-05 Thread GitBox



ivorzhou commented on pull request #2091:
URL: https://github.com/apache/hudi/pull/2091#issuecomment-755121964


   > @ivorzhou : is the requirement to set default value or value from previous 
version of the record? if previous version of the record, then guess we already 
have another PR for this #2106
   
   I want to update record with specified field like SQL update.  
   UPDATE col='value' where id=1   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1510) Move HoodieEngineContext to hudi-common module

2021-01-05 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1510:

  Component/s: (was: Writer Core)
   (was: Common Core)
   Code Cleanup
Fix Version/s: 0.7.0

> Move HoodieEngineContext to hudi-common module
> --
>
> Key: HUDI-1510
> URL: https://issues.apache.org/jira/browse/HUDI-1510
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Code Cleanup
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.7.0
>
>
> We want to parallelize file system listings in Hudi, which after RFC-15 
> implementation will be happening through class *FileSystemBackedMetadata* 
> which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, 
> we want to move HoodieEngineContext to hudi-common to be able to parallelize 
> the file/partition listing APIs using the engine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] umehrot2 opened a new pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common

2021-01-05 Thread GitBox



umehrot2 opened a new pull request #2410:
URL: https://github.com/apache/hudi/pull/2410


   ## What is the purpose of the pull request
   
   Moves HoodieEngineContext class and its dependencies to hudi-common, so that 
we can parallelize fetching of files and partitions in 
FileSystemBackedTableMetadata and FSUtils.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1510) Move HoodieEngineContext to hudi-common module

2021-01-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1510:
-
Labels: pull-request-available  (was: )

> Move HoodieEngineContext to hudi-common module
> --
>
> Key: HUDI-1510
> URL: https://issues.apache.org/jira/browse/HUDI-1510
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Code Cleanup
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We want to parallelize file system listings in Hudi, which after RFC-15 
> implementation will be happening through class *FileSystemBackedMetadata* 
> which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, 
> we want to move HoodieEngineContext to hudi-common to be able to parallelize 
> the file/partition listing APIs using the engine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2379: [HUDI-1399] support a independent clustering spark job to asynchronously clustering

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2379:
URL: https://github.com/apache/hudi/pull/2379#issuecomment-751244130


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=h1) Report
   > Merging 
[#2379](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=desc) (d53595e) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc)
 (6cdf59d) will **decrease** coverage by `42.57%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2379/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2379   +/-   ##
   
   - Coverage 52.23%   9.65%   -42.58% 
   + Complexity 2662  48 -2614 
   
 Files   335  53  -282 
 Lines 149811926-13055 
 Branches   1506 231 -1275 
   
   - Hits   7825 186 -7639 
   + Misses 65331727 -4806 
   + Partials623  13  -610 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.65% <0.00%> (-60.00%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2379?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.76%)` | `0.00 <0.00> (-49.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | ... and [312 
more](https://codecov.io/gh/apache/hudi/pull/2379/diff?src=pr=tree-more)

[jira] [Created] (HUDI-1510) Move HoodieEngineContext to hudi-common module

2021-01-05 Thread Udit Mehrotra (Jira)

Udit Mehrotra created HUDI-1510:
---

 Summary: Move HoodieEngineContext to hudi-common module
 Key: HUDI-1510
 URL: https://issues.apache.org/jira/browse/HUDI-1510
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core, Writer Core
Reporter: Udit Mehrotra
Assignee: Udit Mehrotra


We want to parallelize file system listings in Hudi, which after RFC-15 
implementation will be happening through class *FileSystemBackedMetadata* which 
is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, we want 
to move HoodieEngineContext to hudi-common to be able to parallelize the 
file/partition listing APIs using the engine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1510) Move HoodieEngineContext to hudi-common module

2021-01-05 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1510:

Issue Type: Improvement  (was: Bug)

> Move HoodieEngineContext to hudi-common module
> --
>
> Key: HUDI-1510
> URL: https://issues.apache.org/jira/browse/HUDI-1510
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We want to parallelize file system listings in Hudi, which after RFC-15 
> implementation will be happening through class *FileSystemBackedMetadata* 
> which is in hudi-common. Same with FSUtils which sits in hudi-common. Thus, 
> we want to move HoodieEngineContext to hudi-common to be able to parallelize 
> the file/partition listing APIs using the engine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#issuecomment-754665459


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=h1) Report
   > Merging 
[#2405](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=desc) (9bded33) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/698694a1571cdcc9848fc79aa34c8cbbf9662bc4?el=desc)
 (698694a) will **decrease** coverage by `40.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2405/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2405   +/-   ##
   =
   - Coverage 50.23%   10.04%   -40.20% 
   + Complexity 2985   48 -2937 
   =
 Files   410   52  -358 
 Lines 18398 1852-16546 
 Branches   1884  223 -1661 
   =
   - Hits   9242  186 -9056 
   + Misses 8398 1653 -6745 
   + Partials758   13  -745 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2405?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2405/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 |

[GitHub] [hudi] Nieal-Yang commented on pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-05 Thread GitBox



Nieal-Yang commented on pull request #2375:
URL: https://github.com/apache/hudi/pull/2375#issuecomment-755103932


   > > > > Hi @garyli1019. Maybe I think the current implementation is OK. 
Beacause even in streaming job, we need to accumulate batch records in memory 
during the check-point cycle and upsert data into hudi-table when check-point 
triggers. WDYT?
   > > > 
   > > > 
   > > > @Nieal-Yang sorry about the late reply. IMO The batch mode will work 
but not fully taking the advantage of flink. But we can optimize this step by 
step. To ensure this PR will work, would you add some unit tests to this PR?
   > > 
   > > 
   > > okay. This index has been used in production in my company. We will 
optimize it gradually
   > 
   > Hi @Nieal-Yang Would you please tell me what's the Flink version you are 
using?
   @yanghua Yeah. The version we are using is flink-1.11.2.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2405: [HUDI-1506] Fix wrong exception thrown in HoodieAvroUtils

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #2405:
URL: https://github.com/apache/hudi/pull/2405#issuecomment-754665459







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

85 matches

Mail list logo