[GitHub] [hudi] nsivabalan edited a comment on issue #2839: HIVE_SKIP_RO_SUFFIX config not working

2021-04-16 Thread GitBox


nsivabalan edited a comment on issue #2839:
URL: https://github.com/apache/hudi/issues/2839#issuecomment-821765589


   thanks for reporting it. looks like its a bug, have filed a 
[jira](https://issues.apache.org/jira/browse/HUDI-1806). Feel free to put in a 
patch if interested :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed issue #2839: HIVE_SKIP_RO_SUFFIX config not working

2021-04-16 Thread GitBox


nsivabalan closed issue #2839:
URL: https://github.com/apache/hudi/issues/2839


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2839: HIVE_SKIP_RO_SUFFIX config not working

2021-04-16 Thread GitBox


nsivabalan commented on issue #2839:
URL: https://github.com/apache/hudi/issues/2839#issuecomment-821765589


   looks like its a bug, have filed a 
[jira](https://issues.apache.org/jira/browse/HUDI-1806). Feel free to put in a 
patch if interested :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1805) Honor "skipROSuffix" in spark ds

2021-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-1805.
-
Resolution: Duplicate

> Honor "skipROSuffix" in spark ds
> 
>
> Key: HUDI-1805
> URL: https://issues.apache.org/jira/browse/HUDI-1805
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:high
> Fix For: 0.9.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> In HoodieSparkSqlWriter#buildSyncConfig(), we don't set skipROSuffix based on 
> configs. This needs fixing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1806) Honor "skipROSuffix" in spark ds

2021-04-16 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1806:
-

 Summary: Honor "skipROSuffix" in spark ds
 Key: HUDI-1806
 URL: https://issues.apache.org/jira/browse/HUDI-1806
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: sivabalan narayanan
 Fix For: 0.9.0


In HoodieSparkSqlWriter#buildSyncConfig(), we don't set skipROSuffix based on 
configs. This needs fixing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] rubenssoto commented on issue #2588: [SUPPORT] Cannot create hive connection

2021-04-16 Thread GitBox


rubenssoto commented on issue #2588:
URL: https://github.com/apache/hudi/issues/2588#issuecomment-821765451


   @nsivabalan 
   I really tried, our migration to hudi was late more than 2 months, we have a 
2 months ticket with AWS and no solution was gave to us.
   
   We will create a file on table folder with all table columns, when the new 
dataframe have different columns comparing to that file, we will enable hudi 
hive sync. I think it could solve the problem for now, until aws gave to us a 
better solution.
   
   Another approach that we want to try, is to sync hive table through 
metastore, disabling jdbc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1805) Honor "skipROSuffix" in spark ds

2021-04-16 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1805:
-

 Summary: Honor "skipROSuffix" in spark ds
 Key: HUDI-1805
 URL: https://issues.apache.org/jira/browse/HUDI-1805
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: sivabalan narayanan
 Fix For: 0.9.0


In HoodieSparkSqlWriter#buildSyncConfig(), we don't set skipROSuffix based on 
configs. This needs fixing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on issue #2838: [SUPPORT]User class threw exception: org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata table

2021-04-16 Thread GitBox


nsivabalan commented on issue #2838:
URL: https://github.com/apache/hudi/issues/2838#issuecomment-821764890


   @prashantwason : can you assist here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException

2021-04-16 Thread GitBox


nsivabalan commented on issue #2834:
URL: https://github.com/apache/hudi/issues/2834#issuecomment-821764795


   @yanghua : related to Flink. can you (or loop in someone to) assist here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on issue #2830: [SUPPORT]same _hoodie_record_key has duplicates data

2021-04-16 Thread GitBox


nsivabalan edited a comment on issue #2830:
URL: https://github.com/apache/hudi/issues/2830#issuecomment-821764581


   @wsxGit: Is it possible to give us some reproducible code snippet.  also, 
can you clarify if you are seeing duplicates in both spark ds/spark sql and 
hive or just in hive. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2830: [SUPPORT]same _hoodie_record_key has duplicates data

2021-04-16 Thread GitBox


nsivabalan commented on issue #2830:
URL: https://github.com/apache/hudi/issues/2830#issuecomment-821764581


   Is it possible to give us some reproducible code snippet.  also, can you 
clarify if you are seeing duplicates in both spark ds/spark sql and hive or 
just in hive. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2829: Getting an Exception Property hoodie.deltastreamer.schemaprovider.registry.baseUrl not found

2021-04-16 Thread GitBox


nsivabalan commented on issue #2829:
URL: https://github.com/apache/hudi/issues/2829#issuecomment-821764391


   @pratyakshsharma : can you take this up. related to 
HoodieMultiTableDeltaStreamer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2588: [SUPPORT] Cannot create hive connection

2021-04-16 Thread GitBox


nsivabalan commented on issue #2588:
URL: https://github.com/apache/hudi/issues/2588#issuecomment-821761181


   and wrt your point about syncing hive only when required:
   we might have to establish hive connection to get the schema and compare w/ 
latest hudi schema and to find differences in partitions. so not sure how we 
can avoid creating hive connections if no changes are done to schema and 
partitions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2588: [SUPPORT] Cannot create hive connection

2021-04-16 Thread GitBox


nsivabalan commented on issue #2588:
URL: https://github.com/apache/hudi/issues/2588#issuecomment-821760500


   Did you file a ticket w/ EMR then? if you know the actual root cause and the 
fix, would be good to bring it up w/ EMR folks so that EMR can patch the fix 
and others can benefit. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1769) websites updates for 0.8.0 release

2021-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1769:
--
Description: 
Dumping all changes required to be done in our website wrt 0.8.0 release.
 # fix min spark supported version. 2.4.3 and 3.*
 # talk about diff bundles we have (Hudi-spark-bundles) and when to use which 
one.
 # configurations
 ## bulk insert row writer. 
 ## [https://github.com/apache/hudi/issues/2760]
 ## hoodie.merge.allow.duplicate.on.inserts
 # migration guide
 # doc updates : moving all docs to 0.8.0 dir etc. regular release guide. 

  was:
Dumping all changes required to be done in our website wrt 0.8.0 release.
 # fix min spark supported version. 2.4.3 and 3.*
 # talk about diff bundles we have (Hudi-spark-bundles) and when to use which 
one.
 # configurations
 ## bulk insert row writer. 
 ## https://github.com/apache/hudi/issues/2760
 # migration guide
 # doc updates : moving all docs to 0.8.0 dir etc. regular release guide. 


> websites updates for 0.8.0 release
> --
>
> Key: HUDI-1769
> URL: https://issues.apache.org/jira/browse/HUDI-1769
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Gary Li
>Priority: Major
>
> Dumping all changes required to be done in our website wrt 0.8.0 release.
>  # fix min spark supported version. 2.4.3 and 3.*
>  # talk about diff bundles we have (Hudi-spark-bundles) and when to use which 
> one.
>  # configurations
>  ## bulk insert row writer. 
>  ## [https://github.com/apache/hudi/issues/2760]
>  ## hoodie.merge.allow.duplicate.on.inserts
>  # migration guide
>  # doc updates : moving all docs to 0.8.0 dir etc. regular release guide. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan closed issue #2656: HUDI insert operation is working same as upsert

2021-04-16 Thread GitBox


nsivabalan closed issue #2656:
URL: https://github.com/apache/hudi/issues/2656


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2656: HUDI insert operation is working same as upsert

2021-04-16 Thread GitBox


nsivabalan commented on issue #2656:
URL: https://github.com/apache/hudi/issues/2656#issuecomment-821758611


   my bad. In latest release, we added support for allowing duplicates w/ 
INSERTs btw. forgot about it when I posted earlier. 
   ensure you set "hoodie.merge.allow.duplicate.on.inserts" to true and your 
operation is "insert". you should see duplicates if you try to ingest same 
batch of records again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2656: HUDI insert operation is working same as upsert

2021-04-16 Thread GitBox


nsivabalan commented on issue #2656:
URL: https://github.com/apache/hudi/issues/2656#issuecomment-821757914


   I guess I understand what's happening. in COW, when creating a new data 
file, hudi reads existing data and merges w/ incoming data. From merging 
standpoint, partition path and record key pairs are considered unique. And so 
even if we insert the same batch again, new data file will not have duplicated 
data. One option is to create unique keys for every record as suggested by 
@pengzhiwei2018. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2787: [SUPPORT] Error upserting bucketType UPDATE for partition

2021-04-16 Thread GitBox


nsivabalan commented on issue #2787:
URL: https://github.com/apache/hudi/issues/2787#issuecomment-821755784


   Can you try enabling schema validation and we can go from there. 
   ```
   hoodie.avro.schema.validate
   ```
   set this property to true. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #2843: [HUDI-1804] Continue to write when Flink write task restart because o…

2021-04-16 Thread GitBox


codecov-commenter edited a comment on pull request #2843:
URL: https://github.com/apache/hudi/pull/2843#issuecomment-821748199


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2843](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (dc3e5c2) into 
[master](https://codecov.io/gh/apache/hudi/commit/b6d949b48a649acac27d5d9b91677bf2e25e9342?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (b6d949b) will **decrease** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2843/graphs/tree.svg?width=650=150=pr=VTTXabwbs2_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2843  +/-   ##
   
   - Coverage 52.60%   52.58%   -0.02% 
   + Complexity 3709 3708   -1 
   
 Files   485  485  
 Lines 2322423227   +3 
 Branches   2465 2466   +1 
   
   - Hits  1221612214   -2 
   - Misses 9929 9934   +5 
 Partials   1079 1079  
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (ø)` | `215.00 <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.66% <ø> (-0.03%)` | `1976.00 <ø> (-1.00)` | |
   | hudiflink | `56.51% <ø> (-0.04%)` | `516.00 <ø> (ø)` | |
   | hudihadoopmr | `33.33% <ø> (ø)` | `198.00 <ø> (ø)` | |
   | hudisparkdatasource | `72.06% <ø> (ø)` | `237.00 <ø> (ø)` | |
   | hudisync | `45.70% <ø> (ø)` | `131.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `62.00 <ø> (ø)` | |
   | hudiutilities | `69.79% <ø> (ø)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `15.00% <0.00%> (-1.00%)` | |
   | 
[.../hudi/table/format/cow/CopyOnWriteInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvY293L0NvcHlPbldyaXRlSW5wdXRGb3JtYXQuamF2YQ==)
 | `55.33% <0.00%> (-0.75%)` | `20.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/common/fs/StorageSchemes.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1N0b3JhZ2VTY2hlbWVzLmphdmE=)
 | `100.00% <0.00%> (ø)` | `10.00% <0.00%> (ø%)` | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #2843: [HUDI-1804] Continue to write when Flink write task restart because o…

2021-04-16 Thread GitBox


codecov-commenter edited a comment on pull request #2843:
URL: https://github.com/apache/hudi/pull/2843#issuecomment-821748199






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #2843: [HUDI-1804] Continue to write when Flink write task restart because o…

2021-04-16 Thread GitBox


codecov-commenter commented on pull request #2843:
URL: https://github.com/apache/hudi/pull/2843#issuecomment-821748199


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2843](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (dc3e5c2) into 
[master](https://codecov.io/gh/apache/hudi/commit/b6d949b48a649acac27d5d9b91677bf2e25e9342?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (b6d949b) will **increase** coverage by `17.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2843/graphs/tree.svg?width=650=150=pr=VTTXabwbs2_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2843   +/-   ##
   =
   + Coverage 52.60%   69.79%   +17.19% 
   + Complexity 3709  373 -3336 
   =
 Files   485   54  -431 
 Lines 23224 1993-21231 
 Branches   2465  235 -2230 
   =
   - Hits  12216 1391-10825 
   + Misses 9929  471 -9458 
   + Partials   1079  131  -948 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.79% <ø> (ø)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2843?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...hadoop/realtime/RealtimeCompactedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lQ29tcGFjdGVkUmVjb3JkUmVhZGVyLmphdmE=)
 | | | |
   | 
[...ache/hudi/common/util/collection/DiskBasedMap.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9EaXNrQmFzZWRNYXAuamF2YQ==)
 | | | |
   | 
[...a/org/apache/hudi/common/util/ClusteringUtils.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ2x1c3RlcmluZ1V0aWxzLmphdmE=)
 | | | |
   | 
[...va/org/apache/hudi/metadata/BaseTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvQmFzZVRhYmxlTWV0YWRhdGEuamF2YQ==)
 | | | |
   | 
[.../apache/hudi/common/bootstrap/FileStatusUtils.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jvb3RzdHJhcC9GaWxlU3RhdHVzVXRpbHMuamF2YQ==)
 | | | |
   | 
[...mmon/table/log/block/HoodieDeleteBlockVersion.java](https://codecov.io/gh/apache/hudi/pull/2843/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVEZWxldGVCbG9ja1ZlcnNpb24uamF2YQ==)
 | | | |
   | 

[jira] [Updated] (HUDI-1765) disabling auto commit does not work with spark ds

2021-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1765:
--
Description: 
commitAndPerformPostOperations in HoodieSparkSqlWriter commits all operations 
w/o checking for autoCommitProp. So, this prop is not being honored at spark ds 
layer. 

 

Added below test to TestCOWDataSource and it fails. 

[https://gist.github.com/nsivabalan/479320351bec95ba3e0c0dfa5abb2b06]

 

  was:commitAndPerformPostOperations in HoodieSparkSqlWriter commits all 
operations w/o checking for autoCommitProp. So, this prop is not being honored 
at spark ds layer. 


> disabling auto commit does not work with spark ds
> -
>
> Key: HUDI-1765
> URL: https://issues.apache.org/jira/browse/HUDI-1765
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.9.0
>
>
> commitAndPerformPostOperations in HoodieSparkSqlWriter commits all operations 
> w/o checking for autoCommitProp. So, this prop is not being honored at spark 
> ds layer. 
>  
> Added below test to TestCOWDataSource and it fails. 
> [https://gist.github.com/nsivabalan/479320351bec95ba3e0c0dfa5abb2b06]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1765) disabling auto commit does not work with spark ds

2021-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1765:
--
Description: commitAndPerformPostOperations in HoodieSparkSqlWriter commits 
all operations w/o checking for autoCommitProp. So, this prop is not being 
honored at spark ds layer.   (was: looks like when delete operation auto 
commits even if auto commit config is set to false. 

 

 )

> disabling auto commit does not work with spark ds
> -
>
> Key: HUDI-1765
> URL: https://issues.apache.org/jira/browse/HUDI-1765
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.9.0
>
>
> commitAndPerformPostOperations in HoodieSparkSqlWriter commits all operations 
> w/o checking for autoCommitProp. So, this prop is not being honored at spark 
> ds layer. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1765) disabling auto commit does not work with spark ds

2021-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1765:
--
Summary: disabling auto commit does not work with spark ds  (was: disabling 
auto commit does not work for delete operation)

> disabling auto commit does not work with spark ds
> -
>
> Key: HUDI-1765
> URL: https://issues.apache.org/jira/browse/HUDI-1765
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.9.0
>
>
> looks like when delete operation auto commits even if auto commit config is 
> set to false. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1765) disabling auto commit does not work for delete operation

2021-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1765:
--
Labels: sev:critical  (was: sev:normal)

> disabling auto commit does not work for delete operation
> 
>
> Key: HUDI-1765
> URL: https://issues.apache.org/jira/browse/HUDI-1765
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.9.0
>
>
> looks like when delete operation auto commits even if auto commit config is 
> set to false. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] jtmzheng commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-04-16 Thread GitBox


jtmzheng commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-821583643


   @nsivabalan I tried bumping to 0.8.0 in the Dockerfile provided above but 
I'm running into same errors. Can you reprod on your end? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #2842: Added metric reporter Prometheus to HoodieBackedTableMetadataWriter

2021-04-16 Thread GitBox


codecov-commenter commented on pull request #2842:
URL: https://github.com/apache/hudi/pull/2842#issuecomment-821311104


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2842?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2842](https://codecov.io/gh/apache/hudi/pull/2842?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (cb8c4a9) into 
[master](https://codecov.io/gh/apache/hudi/commit/1d53d6e6c2d3ff967d333d87b308406c8cd6917a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (1d53d6e) will **increase** coverage by `17.25%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2842/graphs/tree.svg?width=650=150=pr=VTTXabwbs2_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2842?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2842   +/-   ##
   =
   + Coverage 52.58%   69.84%   +17.25% 
   + Complexity 3709  374 -3335 
   =
 Files   485   54  -431 
 Lines 23227 1993-21234 
 Branches   2466  235 -2231 
   =
   - Hits  12215 1392-10823 
   + Misses 9934  471 -9463 
   + Partials   1078  130  -948 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.84% <ø> (ø)` | `374.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2842?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../common/table/log/block/HoodieLogBlockVersion.java](https://codecov.io/gh/apache/hudi/pull/2842/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVMb2dCbG9ja1ZlcnNpb24uamF2YQ==)
 | | | |
   | 
[...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2842/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=)
 | | | |
   | 
[.../org/apache/hudi/sink/utils/NonThrownExecutor.java](https://codecov.io/gh/apache/hudi/pull/2842/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL3V0aWxzL05vblRocm93bkV4ZWN1dG9yLmphdmE=)
 | | | |
   | 
[...g/apache/hudi/sink/partitioner/BucketAssigner.java](https://codecov.io/gh/apache/hudi/pull/2842/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL3BhcnRpdGlvbmVyL0J1Y2tldEFzc2lnbmVyLmphdmE=)
 | | | |
   | 
[...che/hudi/common/table/timeline/dto/InstantDTO.java](https://codecov.io/gh/apache/hudi/pull/2842/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9JbnN0YW50RFRPLmphdmE=)
 | | | |
   | 
[.../apache/hudi/sink/compact/CompactionPlanEvent.java](https://codecov.io/gh/apache/hudi/pull/2842/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL2NvbXBhY3QvQ29tcGFjdGlvblBsYW5FdmVudC5qYXZh)
 | | | |
   | 

[jira] [Updated] (HUDI-765) Implement OrcReaderIterator

2021-04-16 Thread Teresa Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teresa Kang updated HUDI-765:
-
Status: In Progress  (was: Open)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: Teresa Kang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-765) Implement OrcReaderIterator

2021-04-16 Thread Teresa Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teresa Kang updated HUDI-765:
-
Status: Patch Available  (was: In Progress)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: Teresa Kang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614929242



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestDelete.scala
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+class TestDelete extends TestHoodieSqlBase {
+
+  test("Test Delete Table") {
+withTempDir { tmp =>
+  Seq("cow", "mor").foreach {tableType =>
+val tableName = generateTableName
+// create table
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName'
+ | options (
+ |  type = '$tableType',
+ |  primaryKey = 'id',
+ |  versionColumn = 'ts'
+ | )
+   """.stripMargin)
+// insert data to table
+spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
+checkAnswer(s"select id, name, price, ts from $tableName")(
+  Seq(1, "a1", 10.0, 1000)
+)
+
+// delete table

Review comment:
   Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-16 Thread GitBox


lw309637554 commented on pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#issuecomment-821230101


   > @lw309637554 I reviewed this PR but just want to get another set of eyes. 
do you mind taking a pass on this?
   okay 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE

2021-04-16 Thread GitBox


lw309637554 commented on pull request #2722:
URL: https://github.com/apache/hudi/pull/2722#issuecomment-821228266


   @xiarixiaoyao  hello , will review it this weekend 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1804) Continue to write when Flink write task restart because of container killing

2021-04-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1804:
-
Labels: pull-request-available  (was: )

> Continue to write when Flink write task restart because of container killing
> 
>
> Key: HUDI-1804
> URL: https://issues.apache.org/jira/browse/HUDI-1804
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The {{FlinkMergeHande}} creates a marker file under the metadata path each 
> time it initializes, when a write task restarts from killing, it tries to 
> create the existing file and reports error.
> To solve this problem, skip the creation and use the original data file as 
> base file to merge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 opened a new pull request #2843: [HUDI-1804] Continue to write when Flink write task restart because o…

2021-04-16 Thread GitBox


danny0405 opened a new pull request #2843:
URL: https://github.com/apache/hudi/pull/2843


   …f container killing
   
   The `FlinkMergeHande` creates a marker file under the metadata path
   each time it initializes, when a write task restarts from killing, it
   tries to create the existing file and reports error.
   
   To solve this problem, skip the creation and use the original data file
   as base file to merge.
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sbernauer opened a new pull request #2842: Added metric reporter Prometheus to HoodieBackedTableMetadataWriter

2021-04-16 Thread GitBox


sbernauer opened a new pull request #2842:
URL: https://github.com/apache/hudi/pull/2842


   ## What is the purpose of the pull request
   
   Add support for metric reporter type Prometheus inside 
HoodieBackedTableMetadataWriter
   
   ## Brief change log
   
 - Added case statement for Prometheus metric reporter
 - I removed the existing TODO marker since it had no description and i 
personally cant see a TODO here. But im also fine to leave it here
   
   ## Verify this pull request
   - Sadly i cant run the Deltastream with metadata enabled because of the 
hbase dependency of Hudi and our Hadoop 3.3.0 containing a to new guava version 
(21 vs 27). I will open an Issue for this. But i think change is trivial
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] AkshayChan opened a new issue #2841: hoodie.cleaner.commits.retained max value?

2021-04-16 Thread GitBox


AkshayChan opened a new issue #2841:
URL: https://github.com/apache/hudi/issues/2841


   1. Is there a greatest value possible for the number of commits the hudi 
cleaner can retain?
   
   2. What about the greatest possible value for `hoodie.keep.min.commits` and 
`hoodie.keep.max.commits`?
   
   3. What is the difference between `hoodie.cleaner.commits.retained` and 
`hoodie.keep.min.commits`? If I understand correctly, the cleaner commits is 
the maximum number of commits we can query incrementally/point in time from, 
keep min commits is the least number of commits stored (for example in S3) and 
keep max commits is the maximum number of commits stored.
   
   4. Are any commits between the keep min commits and kee[ max commits 
thresholds archived as log files?
   
   5. Are any commits that exceed the max commit threshold deleted?
   
   6. Are there performance issues if I set these values very large? For 
example:
   
   `hoodie.cleaner.commits.retained: 2,
 hoodie.keep.min.commits: 20001,
 hoodie.keep.max.commits: 3`
   
   Thanks in advance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


codecov-io edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-792430670


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2645](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (0c0dbee) into 
[master](https://codecov.io/gh/apache/hudi/commit/18459d4045ec4a85081c227893b226a4d759f84b?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (18459d4) will **increase** coverage by `0.46%`.
   > The diff coverage is `54.75%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2645/graphs/tree.svg?width=650=150=pr=VTTXabwbs2_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2645  +/-   ##
   
   + Coverage 52.26%   52.73%   +0.46% 
   - Complexity 3682 3822 +140 
   
 Files   484  509  +25 
 Lines 2309424656+1562 
 Branches   2456 2774 +318 
   
   + Hits  1207013002 +932 
   - Misses 995910380 +421 
   - Partials   1065 1274 +209 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (+3.35%)` | `215.00 <ø> (+20.00)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.57% <11.11%> (-0.20%)` | `1979.00 <2.00> (+3.00)` | 
:arrow_down: |
   | hudiflink | `56.51% <ø> (-0.07%)` | `516.00 <ø> (+2.00)` | :arrow_down: |
   | hudihadoopmr | `33.33% <ø> (-0.12%)` | `198.00 <ø> (+1.00)` | :arrow_down: 
|
   | hudisparkdatasource | `65.00% <56.21%> (-6.34%)` | `348.00 <109.00> 
(+111.00)` | :arrow_down: |
   | hudisync | `45.62% <0.00%> (+0.15%)` | `131.00 <1.00> (+3.00)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `62.00 <ø> (ø)` | |
   | hudiutilities | `69.79% <ø> (+0.06%)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...rg/apache/hudi/common/table/HoodieTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlQ29uZmlnLmphdmE=)
 | `41.66% <0.00%> (-1.55%)` | `17.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `63.35% <0.00%> (-3.31%)` | `43.00 <0.00> (ø)` | |
   | 
[.../apache/hudi/common/table/TableSchemaResolver.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL1RhYmxlU2NoZW1hUmVzb2x2ZXIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...he/hudi/exception/HoodieDuplicateKeyException.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUR1cGxpY2F0ZUtleUV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 

[GitHub] [hudi] codecov-io edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


codecov-io edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-792430670


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2645](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (0c0dbee) into 
[master](https://codecov.io/gh/apache/hudi/commit/18459d4045ec4a85081c227893b226a4d759f84b?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (18459d4) will **increase** coverage by `0.25%`.
   > The diff coverage is `54.75%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2645/graphs/tree.svg?width=650=150=pr=VTTXabwbs2_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2645  +/-   ##
   
   + Coverage 52.26%   52.51%   +0.25% 
   - Complexity 3682 3760  +78 
   
 Files   484  503  +19 
 Lines 2309424207+1113 
 Branches   2456 2750 +294 
   
   + Hits  1207012713 +643 
   - Misses 995910239 +280 
   - Partials   1065 1255 +190 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (+3.35%)` | `215.00 <ø> (+20.00)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.57% <11.11%> (-0.20%)` | `1979.00 <2.00> (+3.00)` | 
:arrow_down: |
   | hudiflink | `56.51% <ø> (-0.07%)` | `516.00 <ø> (+2.00)` | :arrow_down: |
   | hudihadoopmr | `33.33% <ø> (-0.12%)` | `198.00 <ø> (+1.00)` | :arrow_down: 
|
   | hudisparkdatasource | `65.00% <56.21%> (-6.34%)` | `348.00 <109.00> 
(+111.00)` | :arrow_down: |
   | hudisync | `45.62% <0.00%> (+0.15%)` | `131.00 <1.00> (+3.00)` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.79% <ø> (+0.06%)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...rg/apache/hudi/common/table/HoodieTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlQ29uZmlnLmphdmE=)
 | `41.66% <0.00%> (-1.55%)` | `17.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `63.35% <0.00%> (-3.31%)` | `43.00 <0.00> (ø)` | |
   | 
[.../apache/hudi/common/table/TableSchemaResolver.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL1RhYmxlU2NoZW1hUmVzb2x2ZXIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...he/hudi/exception/HoodieDuplicateKeyException.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUR1cGxpY2F0ZUtleUV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 

[jira] [Created] (HUDI-1804) Continue to write when Flink write task restart because of container killing

2021-04-16 Thread Danny Chen (Jira)
Danny Chen created HUDI-1804:


 Summary: Continue to write when Flink write task restart because 
of container killing
 Key: HUDI-1804
 URL: https://issues.apache.org/jira/browse/HUDI-1804
 Project: Apache Hudi
  Issue Type: Bug
  Components: Flink Integration
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.9.0


The {{FlinkMergeHande}} creates a marker file under the metadata path each time 
it initializes, when a write task restarts from killing, it tries to create the 
existing file and reports error.

To solve this problem, skip the creation and use the original data file as base 
file to merge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614781917



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/test/scala/org/apache/spark/sql/hudi/TestDelete.scala
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import org.apache.hadoop.fs.Path
+
+class TestDelete extends TestHoodieSqlBase {
+
+  test("Test Delete Table") {
+withTempDir { tmp =>
+  val tablePath = new Path(tmp.toString, "deleteTable").toUri.toString
+  spark.sql("set spark.hoodie.shuffle.parallelism=4")
+  spark.sql(
+s"""create table deleteTable (
+   |keyid int,
+   |name string,
+   |price double,
+   |col1 long,
+   |p string,
+   |p1 string,
+   |p2 string) using hudi
+   |partitioned by (p,p1,p2)
+   |options('hoodie.datasource.write.table.type'='MERGE_ON_READ',
+   |'hoodie.datasource.write.precombine.field'='col1',

Review comment:
   here in base sql implement, it called `versionColumn`, should align with 
this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614780712



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/hudi/execution/MergeIntoHudiTable.scala
##
@@ -0,0 +1,332 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.execution
+
+import org.apache.hudi.DataSourceWriteOptions.{DEFAULT_PAYLOAD_OPT_VAL, 
PAYLOAD_CLASS_OPT_KEY, RECORDKEY_FIELD_OPT_KEY}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
WriteOperationType}
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate
+import org.apache.spark.sql.catalyst.merge._
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
BasePredicate, Expression, Literal, PredicateHelper, UnsafeProjection}
+import org.apache.spark.sql.functions.{col, lit, when}
+import org.apache.spark.sql.catalyst.merge.{HudiMergeClause, 
HudiMergeInsertClause}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan, Project}
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.internal.StaticSQLConf
+import org.apache.spark.sql.types.{StringType, StructType}
+
+import scala.collection.mutable.ArrayBuffer
+
+/**
+  * merge command, execute merge operation for hudi
+  */
+case class MergeIntoHudiTable(
+target: LogicalPlan,
+source: LogicalPlan,
+joinCondition: Expression,
+matchedClauses: Seq[HudiMergeClause],
+noMatchedClause: Seq[HudiMergeInsertClause],
+finalSchema: StructType,
+trimmedSchema: StructType) extends RunnableCommand with Logging with 
PredicateHelper {
+
+  var tableMeta: Map[String, String] = null
+
+  lazy val isRecordKeyJoin = {
+val recordKeyFields = 
tableMeta.get(RECORDKEY_FIELD_OPT_KEY).get.split(",").map(_.trim).filter(!_.isEmpty)
+val intersect = joinCondition.references.intersect(target.outputSet).toSeq
+if (recordKeyFields.size == 1 && intersect.size ==1 && 
intersect(0).name.equalsIgnoreCase(recordKeyFields(0))) {
+  true
+} else {
+  false
+}
+  }
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+tableMeta = HudiSQLUtils.getHoodiePropsFromRelation(target, sparkSession)
+val enableHive = "hive" == 
sparkSession.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)

Review comment:
   here please the logic put into a method called 
`hiveCatalogImplementaion`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614780712



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/hudi/execution/MergeIntoHudiTable.scala
##
@@ -0,0 +1,332 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.execution
+
+import org.apache.hudi.DataSourceWriteOptions.{DEFAULT_PAYLOAD_OPT_VAL, 
PAYLOAD_CLASS_OPT_KEY, RECORDKEY_FIELD_OPT_KEY}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
WriteOperationType}
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate
+import org.apache.spark.sql.catalyst.merge._
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
BasePredicate, Expression, Literal, PredicateHelper, UnsafeProjection}
+import org.apache.spark.sql.functions.{col, lit, when}
+import org.apache.spark.sql.catalyst.merge.{HudiMergeClause, 
HudiMergeInsertClause}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan, Project}
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.internal.StaticSQLConf
+import org.apache.spark.sql.types.{StringType, StructType}
+
+import scala.collection.mutable.ArrayBuffer
+
+/**
+  * merge command, execute merge operation for hudi
+  */
+case class MergeIntoHudiTable(
+target: LogicalPlan,
+source: LogicalPlan,
+joinCondition: Expression,
+matchedClauses: Seq[HudiMergeClause],
+noMatchedClause: Seq[HudiMergeInsertClause],
+finalSchema: StructType,
+trimmedSchema: StructType) extends RunnableCommand with Logging with 
PredicateHelper {
+
+  var tableMeta: Map[String, String] = null
+
+  lazy val isRecordKeyJoin = {
+val recordKeyFields = 
tableMeta.get(RECORDKEY_FIELD_OPT_KEY).get.split(",").map(_.trim).filter(!_.isEmpty)
+val intersect = joinCondition.references.intersect(target.outputSet).toSeq
+if (recordKeyFields.size == 1 && intersect.size ==1 && 
intersect(0).name.equalsIgnoreCase(recordKeyFields(0))) {
+  true
+} else {
+  false
+}
+  }
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+tableMeta = HudiSQLUtils.getHoodiePropsFromRelation(target, sparkSession)
+val enableHive = "hive" == 
sparkSession.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)

Review comment:
   here please the logic put into a method called 
`hiveCatalogImplementaion`to be reused




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614779858



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/hudi/execution/CreateHudiTableCommand.scala
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.execution
+
+import java.util.Properties
+
+import org.apache.hadoop.fs.{FSDataInputStream, FSDataOutputStream, 
FileSystem, Path}
+import org.apache.hudi.DataSourceWriteOptions
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.common.model.{HoodieFileFormat, HoodieRecord}
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.config.HoodieWriteConfig.KEYGENERATOR_CLASS_PROP
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.hudi.hadoop.HoodieParquetInputFormat
+import org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
+import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.analysis.{NoSuchDatabaseException, 
TableAlreadyExistsException}
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, 
CatalogTable, CatalogTableType}
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.hive.SparkfunctionWrapper
+import org.apache.spark.sql.internal.StaticSQLConf
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+
+import scala.collection.JavaConverters._
+
+/**
+ * Command for create hoodie table
+ */
+case class CreateHudiTableCommand(table: CatalogTable, ignoreIfExists: 
Boolean) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+assert(table.tableType != CatalogTableType.VIEW)
+assert(table.provider.isDefined)
+
+val tableName = table.identifier.unquotedString
+val sessionState = sparkSession.sessionState
+val tableIsExists = sessionState.catalog.tableExists(table.identifier)
+if (tableIsExists) {
+  if (ignoreIfExists) {
+// scalastyle:off
+return Seq.empty[Row]
+// scalastyle:on
+  } else {
+throw new IllegalArgumentException(s"Table 
${table.identifier.quotedString} already exists")
+  }
+}
+
+val enableHive = "hive" == 
sparkSession.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
+val path = HudiSQLUtils.getTablePath(sparkSession, table)
+val conf = sparkSession.sessionState.newHadoopConf()
+val isTableExists = HudiSQLUtils.tableExists(path, conf)
+val (newSchema, tableOptions) = if (table.tableType == 
CatalogTableType.EXTERNAL && isTableExists) {
+  // if this is an external table & the table has already exists in the 
location,
+  // infer schema from the table meta
+  assert(table.schema.isEmpty, s"Should not specified table schema " +
+s"for an exists hoodie table: ${table.identifier.unquotedString}")
+  // get Schema from the external table
+  val metaClient = 
HoodieTableMetaClient.builder().setBasePath(path).setConf(conf).build()
+  val schemaResolver = new TableSchemaResolver(metaClient)
+  val avroSchema = schemaResolver.getTableAvroSchema(true)
+  val tableSchema = 
SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
+  (tableSchema, table.storage.properties ++ 
metaClient.getTableConfig.getProps.asScala)
+} else {
+  // Add the meta fields to the scheme if this is a managed table or an 
empty external table
+  val fullSchema: StructType = {
+val metaFields = HoodieRecord.HOODIE_META_COLUMNS.asScala
+val dataFields = table.schema.fields.filterNot(f => 
metaFields.contains(f.name))
+val fields = metaFields.map(StructField(_, StringType)) ++ dataFields
+StructType(fields)
+  }
+  (fullSchema, table.storage.properties)
+}
+
+// Append * to tablePath if create dataSourceV1 hudi table
+val newPath = if 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614779619



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/hudi/execution/CreateHudiTableCommand.scala
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.execution
+
+import java.util.Properties
+
+import org.apache.hadoop.fs.{FSDataInputStream, FSDataOutputStream, 
FileSystem, Path}
+import org.apache.hudi.DataSourceWriteOptions
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.common.model.{HoodieFileFormat, HoodieRecord}
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.config.HoodieWriteConfig.KEYGENERATOR_CLASS_PROP
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.hudi.hadoop.HoodieParquetInputFormat
+import org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
+import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.analysis.{NoSuchDatabaseException, 
TableAlreadyExistsException}
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, 
CatalogTable, CatalogTableType}
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.hive.SparkfunctionWrapper
+import org.apache.spark.sql.internal.StaticSQLConf
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
+
+import scala.collection.JavaConverters._
+
+/**
+ * Command for create hoodie table
+ */
+case class CreateHudiTableCommand(table: CatalogTable, ignoreIfExists: 
Boolean) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+assert(table.tableType != CatalogTableType.VIEW)
+assert(table.provider.isDefined)
+
+val tableName = table.identifier.unquotedString
+val sessionState = sparkSession.sessionState
+val tableIsExists = sessionState.catalog.tableExists(table.identifier)
+if (tableIsExists) {
+  if (ignoreIfExists) {
+// scalastyle:off
+return Seq.empty[Row]
+// scalastyle:on
+  } else {
+throw new IllegalArgumentException(s"Table 
${table.identifier.quotedString} already exists")
+  }
+}
+
+val enableHive = "hive" == 
sparkSession.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
+val path = HudiSQLUtils.getTablePath(sparkSession, table)
+val conf = sparkSession.sessionState.newHadoopConf()
+val isTableExists = HudiSQLUtils.tableExists(path, conf)
+val (newSchema, tableOptions) = if (table.tableType == 
CatalogTableType.EXTERNAL && isTableExists) {
+  // if this is an external table & the table has already exists in the 
location,
+  // infer schema from the table meta
+  assert(table.schema.isEmpty, s"Should not specified table schema " +
+s"for an exists hoodie table: ${table.identifier.unquotedString}")
+  // get Schema from the external table
+  val metaClient = 
HoodieTableMetaClient.builder().setBasePath(path).setConf(conf).build()
+  val schemaResolver = new TableSchemaResolver(metaClient)
+  val avroSchema = schemaResolver.getTableAvroSchema(true)
+  val tableSchema = 
SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
+  (tableSchema, table.storage.properties ++ 
metaClient.getTableConfig.getProps.asScala)
+} else {
+  // Add the meta fields to the scheme if this is a managed table or an 
empty external table
+  val fullSchema: StructType = {
+val metaFields = HoodieRecord.HOODIE_META_COLUMNS.asScala
+val dataFields = table.schema.fields.filterNot(f => 
metaFields.contains(f.name))
+val fields = metaFields.map(StructField(_, StringType)) ++ dataFields
+StructType(fields)
+  }
+  (fullSchema, table.storage.properties)
+}
+
+// Append * to tablePath if create dataSourceV1 hudi table
+val newPath = if 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614779342



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/hudi/execution/CreateHudiTableAsSelectCommand.scala
##
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.execution
+
+import java.util.Properties
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hudi.DataSourceWriteOptions
+import org.apache.hudi.DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.config.HoodieWriteConfig.BULKINSERT_PARALLELISM
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.internal.StaticSQLConf
+
+import scala.collection.JavaConverters._
+
+/**
+ * Command for ctas
+ */
+case class CreateHudiTableAsSelectCommand(
+table: CatalogTable,
+mode: SaveMode,
+query: LogicalPlan,
+outputColumnNames: Seq[String]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+assert(table.tableType != CatalogTableType.VIEW)
+assert(table.provider.isDefined)
+val sessionState = sparkSession.sessionState
+var fs: FileSystem = null
+val conf = sessionState.newHadoopConf()
+
+val db = 
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase)
+val tableIdentWithDB = table.identifier.copy(database = Some(db))
+val tableName = tableIdentWithDB.unquotedString
+val enableHive = "hive" == 
sparkSession.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
+val path = HudiSQLUtils.getTablePath(sparkSession, table)
+
+if (sessionState.catalog.tableExists(tableIdentWithDB)) {
+  assert(mode != SaveMode.Overwrite, s"Expect the table $tableName has 
been dropped when the save mode is OverWrite")
+  if (mode == SaveMode.ErrorIfExists) {
+throw new AnalysisException(s"Table $tableName already exists. You eed 
to drop it first.")
+  }
+  if (mode == SaveMode.Ignore) {
+// scalastyle:off
+return Seq.empty
+// scalastyle:on
+  }
+  // append table
+  saveDataIntoHudiTable(sparkSession, table, path, 
table.storage.properties, enableHive)
+} else {
+  val properties = table.storage.properties
+  assert(table.schema.isEmpty)
+  sparkSession.sessionState.catalog.validateTableLocation(table)
+  // create table
+  if (!enableHive) {
+val newTable = if (!HudiSQLUtils.tableExists(path, conf)) {
+  table.copy(schema = query.schema)
+} else {
+  table
+}
+CreateHudiTableCommand(newTable, true).run(sparkSession)
+  }
+  saveDataIntoHudiTable(sparkSession, table, path, properties, enableHive, 
"bulk_insert")
+}
+
+// save necessary parameter in hoodie.properties
+val newProperties = new Properties()
+newProperties.putAll(table.storage.properties.asJava)
+// add table partition
+newProperties.put(PARTITIONPATH_FIELD_OPT_KEY, 
table.partitionColumnNames.mkString(","))
+val metaPath = new Path(path, HoodieTableMetaClient.METAFOLDER_NAME)
+val propertyPath = new Path(metaPath, 
HoodieTableConfig.HOODIE_PROPERTIES_FILE)
+
CreateHudiTableCommand.saveNeededPropertiesIntoHoodie(propertyPath.getFileSystem(conf),
 propertyPath, newProperties)
+Seq.empty[Row]
+  }
+
+  private def saveDataIntoHudiTable(
+  session: SparkSession,
+  table: CatalogTable,
+  tablePath: String,
+  tableOptions: Map[String, String],
+  enableHive: Boolean,
+  operationType: String = "upsert"): Unit = {
+val mode = if (operationType.equals("upsert")) {
+  "append"
+} else {
+  "overwrite"
+}
+val newTableOptions = if (enableHive) {
+  // add hive sync properties
+  val extraOptions = 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614776987



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/catalyst/analysis/HudiAnalysis.scala
##
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.hudi.execution.HudiSQLUtils.HoodieV1Relation
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, HiveTableRelation}
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
+import org.apache.spark.sql.catalyst.merge.HudiMergeIntoUtils
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.hudi.execution.InsertIntoHudiTable
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{StructField, StructType}
+
+import scala.collection.mutable
+
+/**
+ * Analysis rules for hudi. deal with * in merge clause and insert into clause
+ */
+class HudiAnalysis(session: SparkSession, conf: SQLConf) extends 
Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
+case dsv2 @ DataSourceV2Relation(d: HoodieDataSourceInternalTable, _, _, 
_, options) =>
+  HoodieV1Relation.fromV2Relation(dsv2, d, options)
+
+// ResolveReferences rule will resolve * in merge, but the resolve result 
is not what we expected
+// so if source is not resolved, we add a special placeholder to prevert 
ResolveReferences rule to resolve * in mergeIntoTable
+case m @ MergeIntoTable(target, source, _, _, _) =>
+  if (source.resolved) {
+dealWithStarAction(m)
+  } else {
+placeHolderStarAction(m)
+  }
+// we should deal with Meta columns in hudi, it will be safe to deal 
insertIntoStatement here.
+case i @ InsertIntoStatement(table, _, query, overwrite, _)
+  if table.resolved && query.resolved && 
HudiSQLUtils.isHudiRelation(table) =>
+  table match {
+case relation: HiveTableRelation =>
+  val metadata = relation.tableMeta
+  preprocess(i, metadata.identifier.quotedString, 
metadata.partitionSchema, Some(metadata), query, overwrite)
+case dsv2 @ DataSourceV2Relation(d: HoodieDataSourceInternalTable, _, 
_, _, _) =>
+  preprocessV2(i, d, query)
+case l: LogicalRelation =>
+  val catalogTable = l.catalogTable
+  val tblName = 
catalogTable.map(_.identifier.quotedString).getOrElse("unknown")
+  preprocess(i, tblName, catalogTable.get.partitionSchema, 
catalogTable, query, overwrite)
+case _ => i
+  }
+  }
+
+  /**
+   * do align hoodie metacols, hoodie tablehave metaCols which are hidden by 
default, we try our best to fill those cols auto

Review comment:
   table has 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614778297



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/hudi/execution/CreateHudiTableAsSelectCommand.scala
##
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.execution
+
+import java.util.Properties
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hudi.DataSourceWriteOptions
+import org.apache.hudi.DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.config.HoodieWriteConfig.BULKINSERT_PARALLELISM
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.internal.StaticSQLConf
+
+import scala.collection.JavaConverters._
+
+/**
+ * Command for ctas
+ */
+case class CreateHudiTableAsSelectCommand(
+table: CatalogTable,
+mode: SaveMode,
+query: LogicalPlan,
+outputColumnNames: Seq[String]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+assert(table.tableType != CatalogTableType.VIEW)
+assert(table.provider.isDefined)
+val sessionState = sparkSession.sessionState
+var fs: FileSystem = null
+val conf = sessionState.newHadoopConf()
+
+val db = 
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase)
+val tableIdentWithDB = table.identifier.copy(database = Some(db))
+val tableName = tableIdentWithDB.unquotedString
+val enableHive = "hive" == 
sparkSession.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
+val path = HudiSQLUtils.getTablePath(sparkSession, table)
+
+if (sessionState.catalog.tableExists(tableIdentWithDB)) {
+  assert(mode != SaveMode.Overwrite, s"Expect the table $tableName has 
been dropped when the save mode is OverWrite")
+  if (mode == SaveMode.ErrorIfExists) {
+throw new AnalysisException(s"Table $tableName already exists. You eed 
to drop it first.")
+  }
+  if (mode == SaveMode.Ignore) {
+// scalastyle:off
+return Seq.empty
+// scalastyle:on
+  }
+  // append table
+  saveDataIntoHudiTable(sparkSession, table, path, 
table.storage.properties, enableHive)
+} else {
+  val properties = table.storage.properties
+  assert(table.schema.isEmpty)
+  sparkSession.sessionState.catalog.validateTableLocation(table)
+  // create table
+  if (!enableHive) {

Review comment:
   ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614777699



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/spark/sql/catalyst/analysis/HudiOperationsCheck.scala
##
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.hudi.execution.HudiSQLUtils
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.expressions.{Expression, InSubquery, 
Literal, Not}
+import org.apache.spark.sql.catalyst.merge.HudiMergeIntoTable
+import org.apache.spark.sql.catalyst.plans.logical.{DeleteFromTable, 
LogicalPlan}
+import org.apache.spark.sql.execution.command._
+
+/**
+  * check rule which can intercept unSupportOperation for hudi
+  */
+class HudiOperationsCheck(spark: SparkSession) extends (LogicalPlan => Unit) {
+  val catalog = spark.sessionState.catalog
+
+  override def apply(plan: LogicalPlan): Unit = {
+plan foreach {
+  case AlterTableAddPartitionCommand(tableName, _, _) if 
(checkHudiTable(tableName)) =>
+throw new AnalysisException("AlterTableAddPartitionCommand are not 
currently supported in HUDI")
+
+  case a: AlterTableDropPartitionCommand if (checkHudiTable(a.tableName)) 
=>
+throw new AnalysisException("AlterTableDropPartitionCommand are not 
currently supported in HUDI")
+
+  case AlterTableChangeColumnCommand(tableName, _, _) if 
(checkHudiTable(tableName)) =>
+throw new AnalysisException("AlterTableChangeColumnCommand are not 
currently supported in HUDI")
+
+  case AlterTableRenamePartitionCommand(tableName, _, _) if 
(checkHudiTable(tableName)) =>
+throw new AnalysisException("AlterTableRenamePartitionCommand are not 
currently supported in HUDI")
+
+  case AlterTableRecoverPartitionsCommand(tableName, _) if 
(checkHudiTable(tableName)) =>
+throw new AnalysisException("AlterTableRecoverPartitionsCommand are 
not currently supported in HUDI")
+
+  case AlterTableSetLocationCommand(tableName, _, _) if 
(checkHudiTable(tableName)) =>
+throw new AnalysisException("AlterTableSetLocationCommand are not 
currently supported in HUDI")
+
+  case DeleteFromTable(target, Some(condition)) if 
(hasNullAwarePredicateWithNot(condition) && checkHudiTable(target)) =>
+throw new AnalysisException("Null-aware predicate sub-squeries a are 
not currently supported for DELETE")
+
+  case HudiMergeIntoTable(target, _, _, _, noMatchedClauses, _) =>
+noMatchedClauses.map { m =>
+  m.resolvedActions.foreach { action =>
+if (action.targetColNameParts.size > 1) {
+  throw new AnalysisException(s"cannot insert nested values: 
${action.targetColNameParts.mkString("<") }")

Review comment:
   would you please describe why would not insert nested values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614772168



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614776570



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614773873



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614772279



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614771176



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614774065



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614768558



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkUpsertCommitActionExecutor.java
##
@@ -44,6 +44,7 @@ public 
SparkUpsertCommitActionExecutor(HoodieSparkEngineContext context,
   @Override
   public HoodieWriteMetadata> execute() {
 return SparkWriteHelper.newInstance().write(instantTime, inputRecordsRDD, 
context, table,
-config.shouldCombineBeforeUpsert(), 
config.getUpsertShuffleParallelism(), this, true);
+config.shouldCombineBeforeUpsert(), 
config.getUpsertShuffleParallelism(),
+this, 
Boolean.parseBoolean(config.getProps().getProperty("hoodie.tagging.before.insert",
 "true")));

Review comment:
   would you please move `hoodie.tagging.before.insert` and default value 
to HoodieWriteConfig?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614770493



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)

Review comment:
   log.warn?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614771646



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614768558



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkUpsertCommitActionExecutor.java
##
@@ -44,6 +44,7 @@ public 
SparkUpsertCommitActionExecutor(HoodieSparkEngineContext context,
   @Override
   public HoodieWriteMetadata> execute() {
 return SparkWriteHelper.newInstance().write(instantTime, inputRecordsRDD, 
context, table,
-config.shouldCombineBeforeUpsert(), 
config.getUpsertShuffleParallelism(), this, true);
+config.shouldCombineBeforeUpsert(), 
config.getUpsertShuffleParallelism(),
+this, 
Boolean.parseBoolean(config.getProps().getProperty("hoodie.tagging.before.insert",
 "true")));

Review comment:
   would you move `hoodie.tagging.before.insert` and default value to 
HoodieWriteConfig?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sbernauer opened a new pull request #2840: Fixed typo in documentation for configurations

2021-04-16 Thread GitBox


sbernauer opened a new pull request #2840:
URL: https://github.com/apache/hudi/pull/2840


   ## What is the purpose of the pull request
   
   Fixed typo in documentation for configurations
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614772540



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName
+  val OVERWRITEWITHLATESTVROAYLOAD = 
classOf[OverwriteWithLatestAvroPayload].getName
+  val OVERWRITENONDEFAULTSWITHLATESTAVROPAYLOAD = 
classOf[OverwriteNonDefaultsWithLatestAvroPayload].getName
+  val MERGE_MARKER = "_hoodie_merge_marker"
+
+  private val log = LogManager.getLogger(getClass)
+
+  private val tableConfigCache = CacheBuilder
+.newBuilder()
+.maximumSize(1000)
+.build(new CacheLoader[String, Properties] {
+  override def load(k: String): Properties = {
+try {
+  
HoodieTableMetaClient.builder().setConf(SparkSession.active.sparkContext.hadoopConfiguration)
+.setBasePath(k).build().getTableConfig.getProperties
+} catch {
+  // we catch expected error here
+  case e: HoodieIOException =>
+log.error(e.getMessage)
+new Properties()
+  case t: Throwable =>
+throw t
+}
+  }
+})
+
+  def getPropertiesFromTableConfigCache(path: String): Properties = {
+if (path.isEmpty) {
+  throw new HoodieIOException("unexpected empty hoodie table basePath")
+}
+tableConfigCache.get(path)
+  }
+
+  private def matchHoodieRelation(relation: 

[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614770205



##
File path: 
hudi-spark-datasource/hudi-spark3-extensions_2.12/src/main/scala/org/apache/hudi/execution/HudiSQLUtils.scala
##
@@ -0,0 +1,553 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution
+
+import java.io.File
+import java.util
+import java.util.{Locale, Properties}
+
+import com.google.common.cache.{CacheBuilder, CacheLoader}
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_SNAPSHOT_OPT_VAL}
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.HoodieSparkSqlWriter.{TableInstantInfo, toProperties}
+import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, 
DataSourceUtils, DataSourceWriteOptions, DefaultSource, 
HoodieBootstrapPartition, HoodieBootstrapRelation, HoodieSparkSqlWriter, 
HoodieSparkUtils, HoodieWriterUtils, MergeOnReadSnapshotRelation}
+import 
org.apache.hudi.common.model.{OverwriteNonDefaultsWithLatestAvroPayload, 
OverwriteWithLatestAvroPayload}
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.client.{HoodieWriteResult, SparkRDDWriteClient}
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model._
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.exception.{HoodieException, HoodieIOException}
+import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.payload.AWSDmsAvroPayload
+import org.apache.hudi.spark3.internal.HoodieDataSourceInternalTable
+import org.apache.log4j.LogManager
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.sources.BaseRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+
+
+/**
+  * hudi IUD utils
+  */
+object HudiSQLUtils {
+
+  val AWSDMSAVROPAYLOAD = classOf[AWSDmsAvroPayload].getName

Review comment:
   not quietly understand why specify AWSDMSAVROPAYLOAD here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-16 Thread GitBox


leesf commented on a change in pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#discussion_r614768775



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitActionExecutor.java
##
@@ -44,6 +44,7 @@ public 
SparkUpsertDeltaCommitActionExecutor(HoodieSparkEngineContext context,
   @Override
   public HoodieWriteMetadata execute() {
 return SparkWriteHelper.newInstance().write(instantTime, inputRecordsRDD, 
context, table,
-config.shouldCombineBeforeUpsert(), 
config.getUpsertShuffleParallelism(),this, true);
+config.shouldCombineBeforeUpsert(), 
config.getUpsertShuffleParallelism(),
+this, 
Boolean.parseBoolean(config.getProps().getProperty("hoodie.tagging.before.insert",
 "true")));

Review comment:
   ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RocMarshal closed pull request #2822: [Hotfix][hudi-sync] Refactor method up to parent-class

2021-04-16 Thread GitBox


RocMarshal closed pull request #2822:
URL: https://github.com/apache/hudi/pull/2822


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luopuya opened a new issue #2839: HIVE_SKIP_RO_SUFFIX config not working

2021-04-16 Thread GitBox


luopuya opened a new issue #2839:
URL: https://github.com/apache/hudi/issues/2839


   How to produce:
   
   ```Scala
   dataFrame.write
 .format("org.apache.hudi")
 .option(HIVE_SKIP_RO_SUFFIX, true)
```

   It seems that it is not working.
   
   buildSyncConfig (in HoodieSparkSqlWriter) did not set skipROSuffix from 
parameters
   
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L398

   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RocMarshal commented on pull request #2822: [Hotfix][hudi-sync] Refactor method up to parent-class

2021-04-16 Thread GitBox


RocMarshal commented on pull request #2822:
URL: https://github.com/apache/hudi/pull/2822#issuecomment-821076075


   @codecov-io rerun tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RocMarshal removed a comment on pull request #2822: [Hotfix][hudi-sync] Refactor method up to parent-class

2021-04-16 Thread GitBox


RocMarshal removed a comment on pull request #2822:
URL: https://github.com/apache/hudi/pull/2822#issuecomment-821070142


   rerun tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RocMarshal commented on pull request #2822: [Hotfix][hudi-sync] Refactor method up to parent-class

2021-04-16 Thread GitBox


RocMarshal commented on pull request #2822:
URL: https://github.com/apache/hudi/pull/2822#issuecomment-821072166


   rerun tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RocMarshal commented on pull request #2822: [Hotfix][hudi-sync] Refactor method up to parent-class

2021-04-16 Thread GitBox


RocMarshal commented on pull request #2822:
URL: https://github.com/apache/hudi/pull/2822#issuecomment-821070142


   rerun tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] GaruGaru commented on issue #2609: [SUPPORT] Presto hudi query slow when compared to parquet

2021-04-16 Thread GitBox


GaruGaru commented on issue #2609:
URL: https://github.com/apache/hudi/issues/2609#issuecomment-821065195


   It can be a nice idea to port changes to trinodb(prestosql) but there may be 
some differences in the `hive-connector` implementation 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] 410680876f1 opened a new issue #2838: [SUPPORT]User class threw exception: org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata table

2021-04-16 Thread GitBox


410680876f1 opened a new issue #2838:
URL: https://github.com/apache/hudi/issues/2838


   I tried to upgrade hudi to version 0.8.0 with set "hoodie.metadata.enable" 
true,but i throw an error when spark writting data to hudi
   
   2021/04/16 14:49:03,837 
[WARN][Class->HoodieBackedTableMetadataWriter][Method->bootstrapIfNeeded]: 
Metadata Table will need to be re-bootstrapped as no instants were found
   2021/04/16 14:49:04,047 [ERROR][Class->ApplicationMaster][Method->logError]: 
User class threw exception: org.apache.hudi.exception.HoodieMetadataException: 
Error syncing to metadata table.
   org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata 
table.
at 
org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:447)
at 
org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:400)
at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:153)
at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:186)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
at com.leqee.sparktool.hoodie.HoodieData.saveData(HoodieData.scala:184)
at com.leqee.sparktool.hoodie.HoodieData.upsert(HoodieData.scala:96)
at com.leqee.sparktool.hoodie.Hoodie$class.upsert(Hoodie.scala:160)
at com.leqee.sparktool.hoodie.HoodieData.upsert(HoodieData.scala:13)
at 
com.leqee.datasync.synctool.sync.SyncTask.setLastSync(SyncTask.scala:235)
at 
com.leqee.datasync.synctool.sync.SyncTask.batchSyncing(SyncTask.scala:159)
at 
com.leqee.datasync.synctool.sync.DateTimeSync.syncAnalyze(DateTimeSync.scala:49)
at com.leqee.datasync.synctool.sync.SyncTask.sync(SyncTask.scala:66)
at 
com.leqee.datasync.synctool.sync.SyncProxy.com$leqee$datasync$synctool$sync$SyncProxy$$runSync(SyncProxy.scala:49)
at 
com.leqee.datasync.synctool.sync.SyncProxy$$anonfun$workFunc$1.apply(SyncProxy.scala:55)
at 
com.leqee.datasync.synctool.sync.SyncProxy$$anonfun$workFunc$1.apply(SyncProxy.scala:55)
at 
com.leqee.datasync.synctool.sync.SyncProxy$$anonfun$runGroup$1.apply(SyncProxy.scala:81)
at 
com.leqee.datasync.synctool.sync.SyncProxy$$anonfun$runGroup$1.apply(SyncProxy.scala:80)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
com.leqee.datasync.synctool.sync.SyncProxy.runGroup(SyncProxy.scala:80)
at com.leqee.datasync.synctool.SyncApp$.main(SyncApp.scala:61)
at com.leqee.datasync.synctool.SyncApp.main(SyncApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614680980



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
##
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import scala.collection.JavaConverters._
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.CatalogTableType
+import org.apache.spark.sql.types.{DoubleType, IntegerType, LongType, 
StringType, StructField}
+
+class TestCreateTable extends TestHoodieSqlBase {
+
+  test("Test Create Managed Hoodie Table") {
+val tableName = generateTableName
+// Create a managed table
+spark.sql(
+  s"""
+ | create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ | ) using hudi
+ | options (
+ |   primaryKey = 'id',
+ |   versionColumn = 'ts'

Review comment:
   There is a check for the versionColumn when create table in 
`CreateHoodieTableCommand#validateTable`. So the `versionColumn` must be a 
field defined in the table columns.  I will also add test for other column name 
as the `versionColumn`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614680980



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
##
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import scala.collection.JavaConverters._
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.CatalogTableType
+import org.apache.spark.sql.types.{DoubleType, IntegerType, LongType, 
StringType, StructField}
+
+class TestCreateTable extends TestHoodieSqlBase {
+
+  test("Test Create Managed Hoodie Table") {
+val tableName = generateTableName
+// Create a managed table
+spark.sql(
+  s"""
+ | create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ | ) using hudi
+ | options (
+ |   primaryKey = 'id',
+ |   versionColumn = 'ts'

Review comment:
   There is a check for the versionColumn when create table in 
`CreateHoodieTableCommand#validateTable`. So the `versionColumn` must be a 
field defined in the table columns.  I think we should not define a default 
value for the versionColumn.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #2609: [SUPPORT] Presto hudi query slow when compared to parquet

2021-04-16 Thread GitBox


tooptoop4 commented on issue #2609:
URL: https://github.com/apache/hudi/issues/2609#issuecomment-821029965


   u can raise the PR @njalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xglv1985 commented on a change in pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


xglv1985 commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614664439



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   > @xglv1985 Would you please also add an entry here 
http://hudi.apache.org/docs/cloud.html and a page like this 
http://hudi.apache.org/docs/ibm_cos_hoodie.html to describe how to use AFS?
   
   Hello, Baidu AFS is currently only open to its staff. As if the way to use 
afs, it is highly similar to HDFS. Actually, AFS do some optimization from the 
base level, but does not change the main user api. The users can not perceive 
their difference. But if the document is mandatory, could I update the doc once 
AFS is open to all users?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614669121



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableAsSelectCommand.scala
##
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command
+
+import org.apache.spark.sql.{Row, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.command.DataWritingCommand
+
+/**
+  * Command for create table as query statement.
+  */
+case class CreateHoodieTableAsSelectCommand(
+   table: CatalogTable,
+   mode: SaveMode,
+   query: LogicalPlan) extends DataWritingCommand {
+
+  override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
+assert(table.tableType != CatalogTableType.VIEW)
+assert(table.provider.isDefined)
+
+val sessionState = sparkSession.sessionState
+val db = 
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase)
+val tableIdentWithDB = table.identifier.copy(database = Some(db))
+val tableName = tableIdentWithDB.unquotedString
+
+if (sessionState.catalog.tableExists(tableIdentWithDB)) {
+  assert(mode != SaveMode.Overwrite,
+s"Expect the table $tableName has been dropped when the save mode is 
Overwrite")
+
+  if (mode == SaveMode.ErrorIfExists) {
+throw new RuntimeException(s"Table $tableName already exists. You need 
to drop it first.")
+  }
+  if (mode == SaveMode.Ignore) {
+// Since the table already exists and the save mode is Ignore, we will 
just return.
+// scalastyle:off
+return Seq.empty
+// scalastyle:on

Review comment:
   The scala style check will fail for the `return` statement, So we should 
turn it off it here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1803) Support BAIDU AFS storage format in hudi

2021-04-16 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-1803:
---
Fix Version/s: 0.9.0

> Support BAIDU AFS storage format in hudi
> 
>
> Key: HUDI-1803
> URL: https://issues.apache.org/jira/browse/HUDI-1803
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xu Guang Lv
>Assignee: Xu Guang Lv
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The storage format of BAIDU Advanced File System(AFS) can naturally be 
> supported by Hudi each time after I checkout hudi source code and modify the 
> related code. Hopefully Hudi will officially support it, for convenience



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1803) Support BAIDU AFS storage format in hudi

2021-04-16 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1803.
--
Resolution: Done

1d53d6e6c2d3ff967d333d87b308406c8cd6917a

> Support BAIDU AFS storage format in hudi
> 
>
> Key: HUDI-1803
> URL: https://issues.apache.org/jira/browse/HUDI-1803
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xu Guang Lv
>Assignee: Xu Guang Lv
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The storage format of BAIDU Advanced File System(AFS) can naturally be 
> supported by Hudi each time after I checkout hudi source code and modify the 
> related code. Hopefully Hudi will officially support it, for convenience



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yanghua commented on a change in pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


yanghua commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614667438



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   OK, sounds good.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua merged pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


yanghua merged pull request #2836:
URL: https://github.com/apache/hudi/pull/2836


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (62b8a34 -> 1d53d6e)

2021-04-16 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 62b8a34  [HUDI-1792] flink-client query error when processing files 
larger than 128mb (#2814)
 add 1d53d6e  [HUDI-1803] Support BAIDU AFS storage format in hudi (#2836)

No new revisions were added by this update.

Summary of changes:
 hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java | 2 ++
 .../src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java | 1 +
 2 files changed, 3 insertions(+)


[GitHub] [hudi] xglv1985 commented on a change in pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


xglv1985 commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614664439



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   > @xglv1985 Would you please also add an entry here 
http://hudi.apache.org/docs/cloud.html and a page like this 
http://hudi.apache.org/docs/ibm_cos_hoodie.html to describe how to use AFS?
   
   Hello, Baidu AFS is currently just open to its staff. As if the way to use 
afs, it is highly similar to HDFS. Actually, AFS do some optimization from the 
base level, but does not change the main user api. The users can not perceive 
their difference. But if the document is mandatory, could I update the doc once 
AFS is open to all users?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614665177



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala
##
@@ -0,0 +1,318 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.analysis
+
+import org.apache.hudi.SparkSqlAdapterSupport
+
+import scala.collection.JavaConverters._
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.spark.SPARK_VERSION
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.analysis.UnresolvedStar
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+import org.apache.spark.sql.catalyst.expressions.{Alias, Expression, Literal, 
NamedExpression}
+import org.apache.spark.sql.catalyst.plans.Inner
+import org.apache.spark.sql.catalyst.plans.logical.{Assignment, DeleteAction, 
DeleteFromTable, InsertAction, LogicalPlan, MergeIntoTable, Project, 
UpdateAction, UpdateTable}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.command.CreateDataSourceTableCommand
+import org.apache.spark.sql.execution.datasources.{CreateTable, 
LogicalRelation}
+import org.apache.spark.sql.hudi.HoodieSqlUtils._
+import org.apache.spark.sql.hudi.command.{CreateHoodieTableAsSelectCommand, 
CreateHoodieTableCommand, DeleteHoodieTableCommand, 
InsertIntoHoodieTableCommand, MergeIntoHoodieTableCommand, 
UpdateHoodieTableCommand}
+import org.apache.spark.sql.types.StringType
+
+object HoodieAnalysis {
+  def customResolutionRules(): Seq[SparkSession => Rule[LogicalPlan]] =
+Seq(
+  session => HoodieResolveReferences(session),
+  session => HoodieAnalysis(session)
+)
+
+  def customPostHocResolutionRules(): Seq[SparkSession => Rule[LogicalPlan]] =
+Seq(
+  session => HoodiePostAnalysisRule(session)
+)
+}
+
+/**
+  * Rule for convert the logical plan to command.
+  * @param sparkSession
+  */
+case class HoodieAnalysis(sparkSession: SparkSession) extends Rule[LogicalPlan]
+  with SparkSqlAdapterSupport {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+plan match {
+  // Convert to MergeIntoHoodieTableCommand
+  case m @ MergeIntoTable(target, _, _, _, _)
+if m.resolved && isHoodieTable(target, sparkSession) =>
+  MergeIntoHoodieTableCommand(m)
+
+  // Convert to UpdateHoodieTableCommand
+  case u @ UpdateTable(table, _, _)
+if u.resolved && isHoodieTable(table, sparkSession) =>
+  UpdateHoodieTableCommand(u)
+
+  // Convert to DeleteHoodieTableCommand
+  case d @ DeleteFromTable(table, _)
+if d.resolved && isHoodieTable(table, sparkSession) =>
+  DeleteHoodieTableCommand(d)
+
+  // Convert to InsertIntoHoodieTableCommand
+  case l if sparkSqlAdapter.isInsertInto(l) =>
+val (table, partition, query, overwrite, _) = 
sparkSqlAdapter.getInsertIntoChildren(l).get
+table match {
+  case relation: LogicalRelation if isHoodieTable(relation, 
sparkSession) =>
+new InsertIntoHoodieTableCommand(relation, query, partition, 
overwrite)
+  case _ =>
+l
+}
+  // Convert to CreateHoodieTableAsSelectCommand
+  case CreateTable(table, mode, Some(query))
+if query.resolved && isHoodieTable(table) =>
+  CreateHoodieTableAsSelectCommand(table, mode, query)
+  case _=> plan
+}
+  }
+}
+
+/**
+  * Rule for resolve hoodie's extended syntax or rewrite some logical plan.
+  * @param sparkSession
+  */
+case class HoodieResolveReferences(sparkSession: SparkSession) extends 
Rule[LogicalPlan]
+  with SparkSqlAdapterSupport {
+  private lazy val analyzer = sparkSession.sessionState.analyzer
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+plan match {
+  // Resolve merge into
+  case MergeIntoTable(target, source, mergeCondition, matchedActions, 
notMatchedActions)
+if isHoodieTable(target, sparkSession) && target.resolved && 
source.resolved =>
+
+def 

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614664550



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/SparkSqlAdapterSupport.scala
##
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.spark.SPARK_VERSION
+import org.apache.spark.sql.hudi.SparkSqlAdapter
+
+trait SparkSqlAdapterSupport {
+
+  lazy val sparkSqlAdapter: SparkSqlAdapter = {
+val adapterClass = if (SPARK_VERSION.startsWith("2.")) {
+  "org.apache.spark.sql.adapter.Spark2SqlAdapter"

Review comment:
   Yes, because the `hudi-spark` project only include the `hudi-spark2` or 
`hudi-spark3`, we cannot include the spark2 dependency and spark3 dependency in 
the project at the same time.
   So we cannot get the `Spark2SqlAdapter` and `Spark3SqlAdapter` at the same 
time when compile. Here we use the string name instead of the class to avoid 
compile error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xglv1985 commented on a change in pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


xglv1985 commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614664439



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   > @xglv1985 Would you please also add an entry here 
http://hudi.apache.org/docs/cloud.html and a page like this 
http://hudi.apache.org/docs/ibm_cos_hoodie.html to describe how to use AFS?
   
   Hello, Baidu AFS is currently just open to its staff. As if the way to use 
afs, it is highly similar to HDFS. Actually, AFS do some optimization from the 
base level, and does not change the main user api. The users can not perceive 
their difference. But if the document is mandatory, could I update the doc once 
AFS is open to all users?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614664550



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/SparkSqlAdapterSupport.scala
##
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.spark.SPARK_VERSION
+import org.apache.spark.sql.hudi.SparkSqlAdapter
+
+trait SparkSqlAdapterSupport {
+
+  lazy val sparkSqlAdapter: SparkSqlAdapter = {
+val adapterClass = if (SPARK_VERSION.startsWith("2.")) {
+  "org.apache.spark.sql.adapter.Spark2SqlAdapter"

Review comment:
   Yes, because the `hudi-spark` project only include the `hudi-spark2` or 
`hudi-spark3`, we cannot include the spark2 dependency and spark3 dependency in 
the project at the same time.
   So we cannot get the `Spark2SqlAdapter` and `Spark3SqlAdapter` at the same 
time in compile time. Here we use the string name instead of the class to avoid 
compile error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xglv1985 commented on a change in pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


xglv1985 commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614664439



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   > @xglv1985 Would you please also add an entry here 
http://hudi.apache.org/docs/cloud.html and a page like this 
http://hudi.apache.org/docs/ibm_cos_hoodie.html to describe how to use AFS?
   
   Hello, Baidu AFS is currently just open to its staff. As if the way to use 
afs, it is highly similar to HDFS. Actually, AFS 
   do some optimization from the base level, and does not change the main user 
api. The users can not perceive their difference. But if the document is 
mandatory, could I update the doc once AFS is open to all users?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614658039



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePayloadProps.java
##
@@ -40,4 +40,12 @@
*/
   public static final String PAYLOAD_EVENT_TIME_FIELD_PROP = 
"hoodie.payload.event.time.field";
   public static String DEFAULT_PAYLOAD_EVENT_TIME_FIELD_VAL = "ts";
+
+  public static final String PAYLOAD_DELETE_CONDITION = 
"hoodie.payload.delete.condition";

Review comment:
   ok




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


codecov-io commented on pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#issuecomment-821007767


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2836?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2836](https://codecov.io/gh/apache/hudi/pull/2836?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1319af) into 
[master](https://codecov.io/gh/apache/hudi/commit/b6d949b48a649acac27d5d9b91677bf2e25e9342?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (b6d949b) will **decrease** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2836/graphs/tree.svg?width=650=150=pr=VTTXabwbs2_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2836?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2836  +/-   ##
   
   - Coverage 52.60%   52.59%   -0.01% 
   + Complexity 3709 3708   -1 
   
 Files   485  485  
 Lines 2322423227   +3 
 Branches   2465 2466   +1 
   
 Hits  1221612216  
   - Misses 9929 9933   +4 
   + Partials   1079 1078   -1 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (ø)` | `215.00 <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.68% <100.00%> (-0.01%)` | `1976.00 <0.00> (-1.00)` | |
   | hudiflink | `56.51% <ø> (-0.04%)` | `516.00 <ø> (ø)` | |
   | hudihadoopmr | `33.33% <ø> (ø)` | `198.00 <ø> (ø)` | |
   | hudisparkdatasource | `72.06% <ø> (ø)` | `237.00 <ø> (ø)` | |
   | hudisync | `45.70% <ø> (ø)` | `131.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `62.00 <ø> (ø)` | |
   | hudiutilities | `69.79% <ø> (ø)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2836?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/common/fs/StorageSchemes.java](https://codecov.io/gh/apache/hudi/pull/2836/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1N0b3JhZ2VTY2hlbWVzLmphdmE=)
 | `100.00% <100.00%> (ø)` | `10.00 <0.00> (ø)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2836/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `15.00% <0.00%> (-1.00%)` | |
   | 
[.../hudi/table/format/cow/CopyOnWriteInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2836/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvY293L0NvcHlPbldyaXRlSW5wdXRGb3JtYXQuamF2YQ==)
 | `55.33% <0.00%> (-0.75%)` | `20.00% <0.00%> (ø%)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2836/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `79.68% <0.00%> (+1.56%)` | `26.00% <0.00%> (ø%)` | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-16 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r614649396



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java
##
@@ -97,4 +86,20 @@ public DefaultHoodieRecordPayload(Option 
record) {
 }
 return metadata.isEmpty() ? Option.empty() : Option.of(metadata);
   }
+
+  protected boolean noNeedUpdatePersistedRecord(IndexedRecord currentValue,

Review comment:
   Yeah, needUpdatePersistedRecord looks more read able




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2836: [HUDI-1803] Support BAIDU AFS storage format in hudi

2021-04-16 Thread GitBox


yanghua commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614643725



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   @xglv1985  Would you please also add an entry here 
http://hudi.apache.org/docs/cloud.html and a page like this 
http://hudi.apache.org/docs/ibm_cos_hoodie.html to describe how to use AFS?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1803) Support BAIDU AFS storage format in hudi

2021-04-16 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-1803:
---
Summary: Support BAIDU AFS storage format in hudi  (was: Hopefully Hudi 
will officially support BAIDU AFS storage format)

> Support BAIDU AFS storage format in hudi
> 
>
> Key: HUDI-1803
> URL: https://issues.apache.org/jira/browse/HUDI-1803
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xu Guang Lv
>Assignee: Xu Guang Lv
>Priority: Minor
>  Labels: pull-request-available
>
> The storage format of BAIDU Advanced File System(AFS) can naturally be 
> supported by Hudi each time after I checkout hudi source code and modify the 
> related code. Hopefully Hudi will officially support it, for convenience



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yanghua commented on a change in pull request #2836: [MINOR] support BAIDU afs. jira id: HUDI-1803

2021-04-16 Thread GitBox


yanghua commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614639114



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   Yes. Baidu: the biggest search engine in China.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 commented on pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-16 Thread GitBox


zherenyu831 commented on pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#issuecomment-820994344


   There is one solution:
   always enable cleaner, which will help you refresh timeline  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liijiankang commented on issue #2837: [SUPPORT]How to measure the performance of upsert

2021-04-16 Thread GitBox


liijiankang commented on issue #2837:
URL: https://github.com/apache/hudi/issues/2837#issuecomment-820987025


   @zherenyu831 thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liijiankang closed issue #2837: [SUPPORT]How to measure the performance of upsert

2021-04-16 Thread GitBox


liijiankang closed issue #2837:
URL: https://github.com/apache/hudi/issues/2837


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2836: [MINOR] support BAIDU afs. jira id: HUDI-1803

2021-04-16 Thread GitBox


vinothchandar commented on a change in pull request #2836:
URL: https://github.com/apache/hudi/pull/2836#discussion_r614623290



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
##
@@ -28,6 +28,8 @@
   FILE("file", false),
   // Hadoop File System
   HDFS("hdfs", true),
+  // Baidu Advanced File System
+  AFS("afs", true),

Review comment:
   supports append? nice!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 commented on issue #2837: [SUPPORT]How to measure the performance of upsert

2021-04-16 Thread GitBox


zherenyu831 commented on issue #2837:
URL: https://github.com/apache/hudi/issues/2837#issuecomment-820956020


   It dependent on count of cores, configuration, ratio of insert/update, 
storage type
   IMO, it is not slow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liijiankang opened a new issue #2837: [SUPPORT]How to measure the performance of upsert

2021-04-16 Thread GitBox


liijiankang opened a new issue #2837:
URL: https://github.com/apache/hudi/issues/2837


   **Describe the problem you faced**
   
   I am a novice  and would appreciate your help.
   We use Structured Streaming to consume the data in Kafka, and then write the 
data to the cow table of hudi.I want to know whether the performance of this 
program is high or low 
   
   
![1](https://user-images.githubusercontent.com/42951757/114981730-0124d580-9ec1-11eb-837a-bb82e29294e2.png)
   
![2](https://user-images.githubusercontent.com/42951757/114981748-097d1080-9ec1-11eb-9526-37e5ffc664b9.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1803) Hopefully Hudi will officially support BAIDU AFS storage format

2021-04-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1803:
-
Labels: pull-request-available  (was: )

> Hopefully Hudi will officially support BAIDU AFS storage format
> ---
>
> Key: HUDI-1803
> URL: https://issues.apache.org/jira/browse/HUDI-1803
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xu Guang Lv
>Assignee: Xu Guang Lv
>Priority: Minor
>  Labels: pull-request-available
>
> The storage format of BAIDU Advanced File System(AFS) can naturally be 
> supported by Hudi each time after I checkout hudi source code and modify the 
> related code. Hopefully Hudi will officially support it, for convenience



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >