date:20200407

yanghua commented on a change in pull request #1449: [WIP][HUDI-698]Add unit 
test for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r405274332
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCleansCommand.java
 ##
 @@ -56,8 +58,16 @@ public void init() throws IOException {
 
 String tableName = "test_table";
 tablePath = basePath + File.separator + tableName;
-propsFilePath = 
this.getClass().getClassLoader().getResource("clean.properties").getPath();
-
+propsFilePath = 
TestCleansCommand.class.getClassLoader().getResource("clean.properties").getPath();
+if (propsFilePath == null) {
+  System.out.println("---");
 
 Review comment:
   Please make sure we do not use STDOUT in the test case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1452: [HUDI-740]Fix can not specify the sparkMaster and code clean for SparkUtil

yanghua commented on a change in pull request #1452: [HUDI-740]Fix can not 
specify the sparkMaster and code clean for SparkUtil
URL: https://github.com/apache/incubator-hudi/pull/1452#discussion_r405272257
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/SparkUtil.java
 ##
 @@ -61,9 +62,14 @@ public static SparkLauncher initLauncher(String 
propertiesFile) throws URISyntax
   }
 
   public static JavaSparkContext initJavaSparkConf(String name) {
+return initJavaSparkConf(name, Option.empty(), Option.empty());
+  }
+
+  public static JavaSparkContext initJavaSparkConf(String name, Option 
master,
+   Option 
executorMemory) {
 SparkConf sparkConf = new SparkConf().setAppName(name);
 
-String defMasterFromEnv = sparkConf.getenv("SPARK_MASTER");
+String defMasterFromEnv = master.orElse(sparkConf.getenv("SPARK_MASTER"));
 
 Review comment:
   +1 to introduce `HoodieCliSparkConfig`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1495: [HUDI-770] Organize upsert/insert API implementation under a single package

codecov-io edited a comment on issue #1495: [HUDI-770] Organize upsert/insert 
API implementation under a single package
URL: https://github.com/apache/incubator-hudi/pull/1495#issuecomment-610761048
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=h1) 
Report
   > Merging 
[#1495](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/b5d093a21bbb19f164fbc549277188f2151232a8=desc)
 will **decrease** coverage by `0.61%`.
   > The diff coverage is `89.09%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1495/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1495  +/-   ##
   
   - Coverage 71.53%   70.92%   -0.62% 
   - Complexity  261  290  +29 
   
 Files   336  348  +12 
 Lines 1574416291 +547 
 Branches   1610 1660  +50 
   
   + Hits  1126311554 +291 
   - Misses 3760 3998 +238 
   - Partials721  739  +18 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `79.64% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/table/action/BaseActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL0Jhc2VBY3Rpb25FeGVjdXRvci5qYXZh)
 | `100.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...action/commit/CopyOnWriteCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9Db3B5T25Xcml0ZUNvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `64.44% <64.44%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `73.77% <66.66%> (-0.39%)` | `0.00 <0.00> (ø)` | |
   | 
[...ction/commit/AbstractBaseCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9BYnN0cmFjdEJhc2VDb21taXRBY3Rpb25FeGVjdXRvci5qYXZh)
 | `84.68% <84.68%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...action/commit/MergeOnReadCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9NZXJnZU9uUmVhZENvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `88.88% <88.88%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...le/action/commit/MergeOnReadUpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9NZXJnZU9uUmVhZFVwc2VydFBhcnRpdGlvbmVyLmphdmE=)
 | `92.15% <92.15%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `94.96% <94.96%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `66.81% <100.00%> (-2.42%)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `31.41% <100.00%> (-57.87%)` | `0.00 <0.00> (ø)` | |
   | ... and [31 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? =

[GitHub] [incubator-hudi] codecov-io commented on issue #1495: [HUDI-770] Organize upsert/insert API implementation under a single package

codecov-io commented on issue #1495: [HUDI-770] Organize upsert/insert API 
implementation under a single package
URL: https://github.com/apache/incubator-hudi/pull/1495#issuecomment-610761048
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=h1) 
Report
   > Merging 
[#1495](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/b5d093a21bbb19f164fbc549277188f2151232a8=desc)
 will **decrease** coverage by `0.61%`.
   > The diff coverage is `89.09%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1495/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1495  +/-   ##
   
   - Coverage 71.53%   70.92%   -0.62% 
   - Complexity  261  290  +29 
   
 Files   336  348  +12 
 Lines 1574416291 +547 
 Branches   1610 1660  +50 
   
   + Hits  1126311554 +291 
   - Misses 3760 3998 +238 
   - Partials721  739  +18 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `79.64% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/table/action/BaseActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL0Jhc2VBY3Rpb25FeGVjdXRvci5qYXZh)
 | `100.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...action/commit/CopyOnWriteCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9Db3B5T25Xcml0ZUNvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `64.44% <64.44%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `73.77% <66.66%> (-0.39%)` | `0.00 <0.00> (ø)` | |
   | 
[...ction/commit/AbstractBaseCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9BYnN0cmFjdEJhc2VDb21taXRBY3Rpb25FeGVjdXRvci5qYXZh)
 | `84.68% <84.68%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...action/commit/MergeOnReadCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9NZXJnZU9uUmVhZENvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `88.88% <88.88%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...le/action/commit/MergeOnReadUpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9NZXJnZU9uUmVhZFVwc2VydFBhcnRpdGlvbmVyLmphdmE=)
 | `92.15% <92.15%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `94.96% <94.96%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `66.81% <100.00%> (-2.42%)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `31.41% <100.00%> (-57.87%)` | `0.00 <0.00> (ø)` | |
   | ... and [31 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1495/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1495?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing

[jira] [Assigned] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups



 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-684:
---

Assignee: (was: Vinoth Chandar)

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-767) Support transformation when export to Hudi

2020-04-07 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-767:

Status: Open  (was: New)

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-767) Support transformation when export to Hudi

2020-04-07 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-767:
---

Assignee: Raymond Xu

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema 
Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-610745060
 
 
   @bvaradar Yes, I marked 767 for 0.6.0. I'll put 768 on waiting list at the 
moment  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-767) Support transformation when export to Hudi

2020-04-07 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-767:

Fix Version/s: 0.6.0

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-425) Implement support for bootstrapping in HoodieDeltaStreamer



 [ 
https://issues.apache.org/jira/browse/HUDI-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-425:

Labels: help-wanted  (was: )

> Implement support for bootstrapping in HoodieDeltaStreamer
> --
>
> Key: HUDI-425
> URL: https://issues.apache.org/jira/browse/HUDI-425
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: help-wanted
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-558) Introduce ability to compress bloom filters while storing in parquet



 [ 
https://issues.apache.org/jira/browse/HUDI-558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-558:
---

Assignee: (was: Balaji Varadarajan)

> Introduce ability to compress bloom filters while storing in parquet
> 
>
> Key: HUDI-558
> URL: https://issues.apache.org/jira/browse/HUDI-558
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on performance study 
> [https://docs.google.com/spreadsheets/d/1KCmmdgaFTWBmpOk9trePdQ2m6wPVj2G328fTcRnQP1M/edit?usp=sharing]
>  we found that there is benefit in compressing bloom filters when storing in 
> parquet. As this is an experimental feature, we will need to disable this 
> feature by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end



 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-289:

Priority: Blocker  (was: Major)

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Blocker
> Fix For: 0.6.0
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-132) Automate doc update/deploy process



 [ 
https://issues.apache.org/jira/browse/HUDI-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-132.
-
Resolution: Duplicate

> Automate doc update/deploy process
> --
>
> Key: HUDI-132
> URL: https://issues.apache.org/jira/browse/HUDI-132
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Current docs (i.e the content powering hudi.apache.org) build, test, deploy 
> process is described at 
> [https://github.com/apache/incubator-hudi/tree/asf-site] 
> Its a two step process (1. change .md/template files, 2. generate the site 
> and upload) for making any changes. It would be nice to have automation on 
> GitHub actions, to automate the deploy of docs on the `asf-site` branch, such 
> that devs can just edit the docs and the merge will build and deploy the site 
> automatically.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results



 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-651:

Priority: Blocker  (was: Major)

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
>

[jira] [Updated] (HUDI-407) Implement a join-based index



 [ 
https://issues.apache.org/jira/browse/HUDI-407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-407:

Priority: Blocker  (was: Major)

> Implement a join-based index
> 
>
> Key: HUDI-407
> URL: https://issues.apache.org/jira/browse/HUDI-407
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie, Performance
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching



 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Priority: Blocker  (was: Major)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-408) [Umbrella] Refactor/Code clean up hoodie write client



 [ 
https://issues.apache.org/jira/browse/HUDI-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-408:

Priority: Blocker  (was: Critical)

> [Umbrella] Refactor/Code clean up hoodie write client 
> --
>
> Key: HUDI-408
> URL: https://issues.apache.org/jira/browse/HUDI-408
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Nishith Agarwal
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-558) Introduce ability to compress bloom filters while storing in parquet



 [ 
https://issues.apache.org/jira/browse/HUDI-558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-558:

Priority: Blocker  (was: Major)

> Introduce ability to compress bloom filters while storing in parquet
> 
>
> Key: HUDI-558
> URL: https://issues.apache.org/jira/browse/HUDI-558
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on performance study 
> [https://docs.google.com/spreadsheets/d/1KCmmdgaFTWBmpOk9trePdQ2m6wPVj2G328fTcRnQP1M/edit?usp=sharing]
>  we found that there is benefit in compressing bloom filters when storing in 
> parquet. As this is an experimental feature, we will need to disable this 
> feature by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching



 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-686:
---

Assignee: lamber-ken  (was: Vinoth Chandar)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment



 [ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-288.
-
Resolution: Fixed

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-242) Support Efficient bootstrap of large parquet datasets to Hudi



 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-242:

Priority: Blocker  (was: Major)

> Support Efficient bootstrap of large parquet datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-408) [Umbrella] Refactor/Code clean up hoodie write client



 [ 
https://issues.apache.org/jira/browse/HUDI-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-408:

Priority: Critical  (was: Major)

> [Umbrella] Refactor/Code clean up hoodie write client 
> --
>
> Key: HUDI-408
> URL: https://issues.apache.org/jira/browse/HUDI-408
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Nishith Agarwal
>Assignee: Vinoth Chandar
>Priority: Critical
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-770) Organize ingest API implementation under a single package

2020-04-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-770:

Labels: pull-request-available  (was: )

> Organize ingest API implementation under a single package
> -
>
> Key: HUDI-770
> URL: https://issues.apache.org/jira/browse/HUDI-770
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Organize write logic as part of upsert/insert API variations into new Action 
> interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] bvaradar opened a new pull request #1495: [HUDI-770] Organize upsert/insert API implementation under a single package

bvaradar opened a new pull request #1495: [HUDI-770] Organize upsert/insert API 
implementation under a single package
URL: https://github.com/apache/incubator-hudi/pull/1495
 
 
   [HUDI-770] Organize upsert/insert API implementation under a single package


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-770) Organize ingest API implementation under a single package

Balaji Varadarajan created HUDI-770:
---

 Summary: Organize ingest API implementation under a single package
 Key: HUDI-770
 URL: https://issues.apache.org/jira/browse/HUDI-770
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.6.0


Organize write logic as part of upsert/insert API variations into new Action 
interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-770) Organize ingest API implementation under a single package



 [ 
https://issues.apache.org/jira/browse/HUDI-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-770:
---

Assignee: Balaji Varadarajan

> Organize ingest API implementation under a single package
> -
>
> Key: HUDI-770
> URL: https://issues.apache.org/jira/browse/HUDI-770
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Organize write logic as part of upsert/insert API variations into new Action 
> interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #241

2020-04-07 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.41 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @

[GitHub] [incubator-hudi] garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder 
with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-610701882
 
 
   Add https://github.com/apache/incubator-hudi/pull/1493 into this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 closed pull request #1493: [MINOR] remove Hive dependency from delta streamer

garyli1019 closed pull request #1493: [MINOR] remove Hive dependency from delta 
streamer
URL: https://github.com/apache/incubator-hudi/pull/1493
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on issue #1493: [MINOR] remove Hive dependency from delta streamer

garyli1019 commented on issue #1493: [MINOR] remove Hive dependency from delta 
streamer
URL: https://github.com/apache/incubator-hudi/pull/1493#issuecomment-610700940
 
 
   combine with https://github.com/apache/incubator-hudi/pull/1486


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

lamber-ken edited a comment on issue #1488: [SUPPORT] Hudi table has only five 
rows when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488#issuecomment-610681461
 
 
   hi @jvaesteves 
   
   > the partition name is /18228, is this the expected behaviour?
   
   it's not the expected behaviour, the partition field also required as string 
type, show you a right demo.
   
   ```
   ${SPARK_HOME}/bin/spark-shell \
   --driver-memory 6G \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
 "hoodie.insert.shuffle.parallelism" -> "10",
 "hoodie.upsert.shuffle.parallelism" -> "10",
 "hoodie.delete.shuffle.parallelism" -> "10",
 "hoodie.bulkinsert.shuffle.parallelism" -> "10",
 "hoodie.datasource.write.recordkey.field" -> "key",
 "hoodie.datasource.write.partitionpath.field" -> "dt", 
 "hoodie.table.name" -> tableName,
 "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 5).
  withColumn("key", $"id").
  withColumn("data", lit("data")).
  withColumn("timestamp", current_timestamp()).
  withColumn("dt", date_format($"timestamp", "-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
 options(hudiOptions).
 mode("Overwrite").
 save(basePath)
   ```
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken closed issue #1375: [SUPPORT] HoodieDeltaStreamer offset not handled correctly

lamber-ken closed issue #1375: [SUPPORT] HoodieDeltaStreamer offset not handled 
correctly
URL: https://github.com/apache/incubator-hudi/issues/1375
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488#issuecomment-610681802
 
 
   
![image](https://user-images.githubusercontent.com/20113411/78730984-eeec8f00-7970-11ea-83bd-fae208a13d62.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488#issuecomment-610681461
 
 
   hi @jvaesteves 
   
   > the partition name is /18228, is this the expected behaviour?
   
   it's not the expected behaviour. the partition field also required as string 
type, show you a right demo.
   
   ```
   ${SPARK_HOME}/bin/spark-shell \
   --driver-memory 6G \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
 "hoodie.insert.shuffle.parallelism" -> "10",
 "hoodie.upsert.shuffle.parallelism" -> "10",
 "hoodie.delete.shuffle.parallelism" -> "10",
 "hoodie.bulkinsert.shuffle.parallelism" -> "10",
 "hoodie.datasource.write.recordkey.field" -> "key",
 "hoodie.datasource.write.partitionpath.field" -> "dt", 
 "hoodie.table.name" -> tableName,
 "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 5).
  withColumn("key", $"id").
  withColumn("data", lit("data")).
  withColumn("timestamp", current_timestamp()).
  withColumn("dt", date_format($"timestamp", "-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
 options(hudiOptions).
 mode("Overwrite").
 save(basePath)
   ```
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-769) Write blog about HoodieMultiTableDeltaStreamer in cwiki



 [ 
https://issues.apache.org/jira/browse/HUDI-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-769:

Description: (was: Relevant Section : 
[https://hudi.apache.org/docs/writing_data.html#deltastreamer]

Add high-level description about this tool )

> Write blog about HoodieMultiTableDeltaStreamer in cwiki
> ---
>
> Key: HUDI-769
> URL: https://issues.apache.org/jira/browse/HUDI-769
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs, docs-chinese
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-769) Write blog about HoodieMultiTableDeltaStreamer in cwiki

Balaji Varadarajan created HUDI-769:
---

 Summary: Write blog about HoodieMultiTableDeltaStreamer in cwiki
 Key: HUDI-769
 URL: https://issues.apache.org/jira/browse/HUDI-769
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Docs, docs-chinese
Reporter: Balaji Varadarajan
Assignee: Pratyaksh Sharma
 Fix For: 0.6.0


Relevant Section : 
[https://hudi.apache.org/docs/writing_data.html#deltastreamer]

Add high-level description about this tool 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-766) Update Apache Hudi website with usage info about HoodieMultiTableDeltaStreamer



 [ 
https://issues.apache.org/jira/browse/HUDI-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-766:

Status: Open  (was: New)

> Update Apache Hudi website with usage info about HoodieMultiTableDeltaStreamer
> --
>
> Key: HUDI-766
> URL: https://issues.apache.org/jira/browse/HUDI-766
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs, docs-chinese
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.0
>
>
> Relevant Section : 
> [https://hudi.apache.org/docs/writing_data.html#deltastreamer]
> Add high-level description about this tool 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[incubator-hudi] branch master updated: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment (#1150)

2020-04-07 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d610252  [HUDI-288]: Add support for ingesting multiple kafka streams 
in a single DeltaStreamer deployment (#1150)
d610252 is described below

commit d610252d6b54dcbb1897bb8881d59c3838a46c18
Author: Pratyaksh Sharma 
AuthorDate: Wed Apr 8 04:40:26 2020 +0530

[HUDI-288]: Add support for ingesting multiple kafka streams in a single 
DeltaStreamer deployment (#1150)

* [HUDI-288]: Add support for ingesting multiple kafka streams in a single 
DeltaStreamer deployment
---
 .../apache/hudi/table/HoodieCommitArchiveLog.java  |  10 +-
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 +-
 .../hudi/common/HoodieTestDataGenerator.java   | 151 ++--
 .../org/apache/hudi/hive/TestHiveSyncTool.java |   6 +
 .../test/java/org/apache/hudi/hive/TestUtil.java   |   9 +-
 .../org/apache/hudi/hive/util/HiveTestService.java |  25 ++
 .../org/apache/hudi/integ/ITTestHoodieDemo.java|   8 +-
 .../scala/org/apache/hudi/DataSourceOptions.scala  |   1 +
 .../hudi/utilities/deltastreamer/DeltaSync.java|   9 +-
 .../deltastreamer/HoodieDeltaStreamer.java |  32 +-
 .../HoodieMultiTableDeltaStreamer.java | 393 +
 .../deltastreamer/TableExecutionContext.java   |  85 +
 .../hudi/utilities/TestHoodieDeltaStreamer.java|  64 +++-
 .../TestHoodieMultiTableDeltaStreamer.java | 166 +
 .../apache/hudi/utilities/UtilitiesTestBase.java   |  19 +-
 .../utilities/sources/AbstractBaseTestSource.java  |  19 +-
 .../sources/AbstractDFSSourceTestBase.java |   2 +-
 .../hudi/utilities/sources/TestKafkaSource.java|   9 +-
 .../utilities/sources/TestParquetDFSSource.java|   2 +-
 .../invalid_hive_sync_uber_config.properties   |  23 ++
 .../short_trip_uber_config.properties  |  24 ++
 .../source_short_trip_uber.avsc|  44 +++
 .../delta-streamer-config/source_uber.avsc |  44 +++
 .../target_short_trip_uber.avsc|  44 +++
 .../delta-streamer-config/target_uber.avsc |  44 +++
 .../delta-streamer-config/uber_config.properties   |  25 ++
 26 files changed, 1184 insertions(+), 78 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieCommitArchiveLog.java 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieCommitArchiveLog.java
index 635e96b..73dd799 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieCommitArchiveLog.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieCommitArchiveLog.java
@@ -292,7 +292,8 @@ public class HoodieCommitArchiveLog {
 archivedMetaWrapper.setActionType(ActionType.clean.name());
 break;
   }
-  case HoodieTimeline.COMMIT_ACTION: {
+  case HoodieTimeline.COMMIT_ACTION:
+  case HoodieTimeline.DELTA_COMMIT_ACTION: {
 HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
 .fromBytes(commitTimeline.getInstantDetails(hoodieInstant).get(), 
HoodieCommitMetadata.class);
 
archivedMetaWrapper.setHoodieCommitMetadata(convertCommitMetadata(commitMetadata));
@@ -311,13 +312,6 @@ public class HoodieCommitArchiveLog {
 archivedMetaWrapper.setActionType(ActionType.savepoint.name());
 break;
   }
-  case HoodieTimeline.DELTA_COMMIT_ACTION: {
-HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
-.fromBytes(commitTimeline.getInstantDetails(hoodieInstant).get(), 
HoodieCommitMetadata.class);
-
archivedMetaWrapper.setHoodieCommitMetadata(convertCommitMetadata(commitMetadata));
-archivedMetaWrapper.setActionType(ActionType.commit.name());
-break;
-  }
   case HoodieTimeline.COMPACTION_ACTION: {
 HoodieCompactionPlan plan = 
CompactionUtils.getCompactionPlan(metaClient, hoodieInstant.getTimestamp());
 archivedMetaWrapper.setHoodieCompactionPlan(plan);
diff --git a/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java 
b/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
index 0e606e4..c6ec523 100644
--- a/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
+++ b/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
@@ -68,7 +68,7 @@ public class TestMultiFS extends HoodieClientTestHarness {
 cleanupTestDataGenerator();
   }
 
-  private HoodieWriteClient getHoodieWriteClient(HoodieWriteConfig config) 
throws Exception {
+  private HoodieWriteClient getHoodieWriteClient(HoodieWriteConfig config) {
 return new HoodieWriteClient(jsc, config);
   }
 
@@ -89,7 +89,7 @@ public class TestMultiFS extends HoodieClientTestHarness {
 HoodieWriteConfig localConfig = getHoodieWriteConfig(tablePath);
 
 try (HoodieWriteClient

[GitHub] [incubator-hudi] bvaradar merged pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

bvaradar merged pull request #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

codecov-io edited a comment on issue #1150: [HUDI-288]: Add support for 
ingesting multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-605617268
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=h1) 
Report
   > Merging 
[#1150](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/eaf6cc2d90bf27c0d9414a4ea18dbd1b61f58e50=desc)
 will **increase** coverage by `0.09%`.
   > The diff coverage is `78.06%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1150/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1150  +/-   ##
   
   + Coverage 71.54%   71.64%   +0.09% 
   - Complexity  261  290  +29 
   
 Files   336  338   +2 
 Lines 1574415931 +187 
 Branches   1610 1625  +15 
   
   + Hits  1126411413 +149 
   - Misses 3759 3785  +26 
   - Partials721  733  +12 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/table/HoodieCommitArchiveLog.java](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29tbWl0QXJjaGl2ZUxvZy5qYXZh)
 | `77.48% <ø> (+2.48%)` | `0.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `71.85% <ø> (-0.15%)` | `38.00 <0.00> (ø)` | |
   | 
[...utilities/deltastreamer/TableExecutionContext.java](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvVGFibGVFeGVjdXRpb25Db250ZXh0LmphdmE=)
 | `65.00% <65.00%> (ø)` | `9.00 <9.00> (?)` | |
   | 
[...s/deltastreamer/HoodieMultiTableDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllTXVsdGlUYWJsZURlbHRhU3RyZWFtZXIuamF2YQ==)
 | `78.88% <78.88%> (ø)` | `18.00 <18.00> (?)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `78.77% <85.71%> (+0.16%)` | `10.00 <2.00> (+2.00)` | |
   | 
[...main/scala/org/apache/hudi/DataSourceOptions.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvRGF0YVNvdXJjZU9wdGlvbnMuc2NhbGE=)
 | `93.25% <100.00%> (+0.07%)` | `0.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/incubator-hudi/pull/1150/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | `88.88% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=footer).
 Last update 
[eaf6cc2...1eece0d](https://codecov.io/gh/apache/incubator-hudi/pull/1150?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

satishkotha commented on a change in pull request #1396: [HUDI-687] Stop
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r405144082

##
File path:
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
##
@@ -118,6 +119,34 @@
return returns.toArray(new FileStatus[returns.size()]);
}

+ /**
+ * Filter any specific instants that we do not want to process.
+ * example timeline:
+ *
+ * t0 -> create bucket1.parquet
+ * t1 -> create and append updates bucket1.log
+ * t2 -> request compaction
+ * t3 -> create bucket2.parquet
+ *
+ * if compaction at t2 takes a long time, incremental readers on RO tables
can move to t3 and would skip updates in t1
+ *
+ * To workaround this problem, we want to stop returning data belonging to
commits > t2.
+ * After compaction is complete, incremental reader would see updates in t2,
t3, so on.
+ */
+ protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline
timeline) {
+Option pendingCompactionInstant =
timeline.filterPendingCompactionTimeline().firstInstant();
+if (pendingCompactionInstant.isPresent()) {

Review comment:
I can introduce jobConf variable. I'm also agreeing with you that RT is the
right approach. I'm just suggesting that we remove incremental read examples in
different documents or throw explicit error when someone tries incremental
reads on RO views. For example, docker demo shows[ incremental reads on RO
views](https://hudi.apache.org/docs/docker_demo.html#step-9-run-hive-queries-including-incremental-queries).
So, people are likely to use this and end up with this difficult to debug
problem.

Also, you are right about last statement. I already have tests to show that
compaction timestamp is used for all updated records and not the update
timestamp. Please see line 256 (last line in
TestMergeOnReadTable#testIncrementalReadsWithCompaction) that does not include
updateTime. We validate that records include compactionCommitTime instead

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r405144082
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -118,6 +119,34 @@
 return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+Option pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   I can introduce jobConf variable. I'm also agreeing with you that RT is the 
right approach. I'm just suggesting that we remove incremental read examples in 
different documents. For example, docker demo shows[ incremental reads on RO 
views](https://hudi.apache.org/docs/docker_demo.html#step-9-run-hive-queries-including-incremental-queries).
 So, people are likely to use this and end up with this difficult to debug 
problem.
   
   Also, you are right about last statement. I already have tests to show that 
compaction timestamp is used for all updated records and not the update 
timestamp. Please see line 256 (last line in 
TestMergeOnReadTable#testIncrementalReadsWithCompaction)  that does not include 
updateTime. We validate that records include compactionCommitTime instead


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

lamber-ken edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610635678
 
 
   > Can you give this a shot on a cluster?
   
   btw, what @vinothchandar wants to say is that run your snippet code on yarn 
cluster if possiable, so we can know how it behaves to when running in a 
cluster.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610635678
 
 
   > Can you give this a shot on a cluster?
   
   more, what @vinothchandar wants to say is that run your snippet code on yarn 
cluster if possiable, so we can know how it behaves to when running in a 
cluster.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610630535
 
 
   
   
![image](https://user-images.githubusercontent.com/20113411/78721369-fc4a4f00-7959-11ea-8fa2-340717c3a233.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610626104
 
 
   hi @tverdokhlebd, thanks your detailed spark log, from your description and 
dataset, key information
   - run on local machine
   - the size of each record is large 
   
   I noticed that OOM happened after parquet read old record to memory, which 
means we need to control the number of old records, so try add this option 
   ```
   .option("hoodie.write.buffer.limit.bytes", "131072")  //128MB
   ```
   
   
![image](https://user-images.githubusercontent.com/20113411/78720105-a96f9800-7957-11ea-86b5-984978d169a2.png)
   
   
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

vinothchandar commented on a change in pull request #1402: [WIP][HUDI-407] 
Adding Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r405076523
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD partitionRecordKeyPairRDD =
+hoodieKeys.mapToPair(key -> new Tuple2<>(key.getPartitionPath(), 
key.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD recordKeyLocationRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+JavaPairRDD keyHoodieKeyPairRDD = 
hoodieKeys.mapToPair(key -> new Tuple2<>(key, null));
+
+return 
keyHoodieKeyPairRDD.leftOuterJoin(recordKeyLocationRDD).mapToPair(keyLoc -> {
+  Option> partitionPathFileidPair;
+  if (keyLoc._2._2.isPresent()) {
+partitionPathFileidPair = 
Option.of(Pair.of(keyLoc._1().getPartitionPath(), 
keyLoc._2._2.get().getFileId()));
+  } else {
+partitionPathFileidPair = Option.empty();
+  }
+  return new Tuple2<>(keyLoc._1, partitionPathFileidPair);
+});
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getBloomIndexUseCaching()) {
+  recordRDD.persist(config.getBloomIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+JavaPairRDD partitionRecordKeyPairRDD =
+recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), 
record.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD keyFilenamePairRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+// Cache the result, for subsequent stages.
+if (config.getBloomIndexUseCaching()) {
+

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r405068106
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestHoodieClientOnCopyOnWriteStorage.java
 ##
 @@ -79,9 +82,22 @@
 import static org.mockito.Mockito.when;
 
 @SuppressWarnings("unchecked")
+@RunWith(Parameterized.class)
 public class TestHoodieClientOnCopyOnWriteStorage extends TestHoodieClientBase 
{
 
   private static final Logger LOG = 
LogManager.getLogger(TestHoodieClientOnCopyOnWriteStorage.class);
+  private final IndexType indexType;
+
+  @Parameterized.Parameters(name = "{index}: Test with IndexType={0}")
+  public static Collection data() {
+Object[][] data =
+new Object[][] 
{{IndexType.BLOOM},{IndexType.GLOBAL_BLOOM},{IndexType.SIMPLE}};
 
 Review comment:
   aprrox 2 mins per index type. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r405067795
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,244 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.clearspring.analytics.util.Lists;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
 
 Review comment:
   There is not a lot of common code blocks between bloom index and simple 
index except for one method (getTaggedRecord). So, didn't find a need to create 
AbstactFileLevelIndex. Open to discussions as to how we can add this and if its 
required in this patch. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610580236
 
 
   > Can you give this a shot on a cluster?
   Do you mean access to the cluster? Those steps also were reproducing on my 
local machine.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610580236
 
 
   > Can you give this a shot on a cluster?
   
   Do you mean access to the cluster? Those steps also were reproducing on my 
local machine.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610564674
 
 
   Code:
   
   sparkSession
 .read
 .jdbc(
   url = jdbcConfig.url,
   table = table,
   columnName = "partition",
   lowerBound = 0,
   upperBound = jdbcConfig.partitionsCount.toInt,
   numPartitions = jdbcConfig.partitionsCount.toInt,
   connectionProperties = new Properties() {
 put("driver", jdbcConfig.driver)
 put("user", jdbcConfig.user)
 put("password", jdbcConfig.password)
   }
 )
 .withColumn("year", substring(col(jdbcConfig.dateColumnName), 0, 4))
 .withColumn("month", substring(col(jdbcConfig.dateColumnName), 6, 2))
 .withColumn("day", substring(col(jdbcConfig.dateColumnName), 9, 2))
 .write
 .option(HoodieWriteConfig.TABLE_NAME, hudiConfig.tableName)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
hudiConfig.recordKey)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
hudiConfig.precombineKey)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
hudiConfig.partitionPathKey)
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
classOf[ComplexKeyGenerator].getName)
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
 .option("hoodie.datasource.write.operation", writeOperation)
 .option("hoodie.bulkinsert.shuffle.parallelism", 
hudiConfig.bulkInsertParallelism)
 .option("hoodie.insert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.upsert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.cleaner.policy", 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
 .option("hoodie.cleaner.fileversions.retained", "1")
 .option("hoodie.metrics.graphite.host", hudiConfig.graphiteHost)
 .option("hoodie.metrics.graphite.port", hudiConfig.graphitePort)
 .option("hoodie.metrics.graphite.metric.prefix", 
hudiConfig.graphiteMetricPrefix)
 .format("org.apache.hudi")
 .mode(SaveMode.Append)
 .save(outputPath)
   
   This code is executing on Jenkins, with next parameters:
   
   docker run --rm -v ${PWD}:${PWD} -v /mnt/ml_data:/mnt/ml_data 
bde2020/spark-master:2.4.5-hadoop2.7 \
   bash ./spark/bin/spark-submit \
   --master "local[2]" \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf spark.local.dir=/mnt/ml_data \
   --conf spark.ui.enabled=false \
   --conf spark.driver.memory=4g \
   --conf spark.driver.memoryOverhead=1024 \
   --conf spark.driver.maxResultSize=2g \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.rdd.compress=true \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.hadoop.fs.defaultFS=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} \
   --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} \
   --conf spark.executorEnv.period.startDate=${date} \
   --conf spark.executorEnv.period.numDays=${numDays} \
   --conf spark.executorEnv.jdbc.url=${VERTICA_URL} \
   --conf spark.executorEnv.jdbc.user=${VERTICA_USER} \
   --conf spark.executorEnv.jdbc.password=${VERTICA_PWD} \
   --conf spark.executorEnv.jdbc.driver=${VERTICA_DRIVER}\
   --conf spark.executorEnv.jdbc.schemaName=mtu_owner \
   --conf spark.executorEnv.jdbc.tableName=ext_ml_data \
   --conf spark.executorEnv.jdbc.dateColumnName=hit_date \
   --conf spark.executorEnv.jdbc.partitionColumnName=hit_timestamp \
   --conf spark.executorEnv.jdbc.partitionsCount=8 \
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.executorEnv.hudi.tableName=ext_ml_data \
   --conf spark.executorEnv.hudi.recordKey=tds_cid \
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp \
   --conf spark.executorEnv.hudi.parallelism=8 \
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8 \
   --class mtu.spark.analytics.ExtMLDataToS3 \
   ${PWD}/ml-vertica-to-s3-hudi.jar
   
   I try to move 53 million records (The table contains 48 columns) from the 
Vertica database to s3 storage.
   Operation "bulk_insert" successfully completes and takes ~40-50 minutes.
   Operation "upsert" on the same records throws exceptions with OOM.
   
   On Hudi 0.5.1 "upsert" operations were hanging. I found the issue 
https://github.com/apache/incubator-hudi/issues/1328 and updated Hudi to 0.5.2. 
The problem with hanging, it seems to me, was resolved.
   
   After the "bulk_insert" operation, the total size of data on S3 storage is 
3.7GB.
   Exported data from the database to CSV file for ~30M records is ~8.6GB.

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r405053500
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD partitionRecordKeyPairRDD =
+hoodieKeys.mapToPair(key -> new Tuple2<>(key.getPartitionPath(), 
key.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD recordKeyLocationRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+JavaPairRDD keyHoodieKeyPairRDD = 
hoodieKeys.mapToPair(key -> new Tuple2<>(key, null));
+
+return 
keyHoodieKeyPairRDD.leftOuterJoin(recordKeyLocationRDD).mapToPair(keyLoc -> {
+  Option> partitionPathFileidPair;
+  if (keyLoc._2._2.isPresent()) {
+partitionPathFileidPair = 
Option.of(Pair.of(keyLoc._1().getPartitionPath(), 
keyLoc._2._2.get().getFileId()));
+  } else {
+partitionPathFileidPair = Option.empty();
+  }
+  return new Tuple2<>(keyLoc._1, partitionPathFileidPair);
+});
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getBloomIndexUseCaching()) {
+  recordRDD.persist(config.getBloomIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+JavaPairRDD partitionRecordKeyPairRDD =
+recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), 
record.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD keyFilenamePairRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+// Cache the result, for subsequent stages.
+if (config.getBloomIndexUseCaching()) {
+

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r405053129
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD partitionRecordKeyPairRDD =
+hoodieKeys.mapToPair(key -> new Tuple2<>(key.getPartitionPath(), 
key.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD recordKeyLocationRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+JavaPairRDD keyHoodieKeyPairRDD = 
hoodieKeys.mapToPair(key -> new Tuple2<>(key, null));
+
+return 
keyHoodieKeyPairRDD.leftOuterJoin(recordKeyLocationRDD).mapToPair(keyLoc -> {
+  Option> partitionPathFileidPair;
+  if (keyLoc._2._2.isPresent()) {
+partitionPathFileidPair = 
Option.of(Pair.of(keyLoc._1().getPartitionPath(), 
keyLoc._2._2.get().getFileId()));
+  } else {
+partitionPathFileidPair = Option.empty();
+  }
+  return new Tuple2<>(keyLoc._1, partitionPathFileidPair);
+});
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getBloomIndexUseCaching()) {
+  recordRDD.persist(config.getBloomIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+JavaPairRDD partitionRecordKeyPairRDD =
+recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), 
record.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD keyFilenamePairRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+// Cache the result, for subsequent stages.
+if (config.getBloomIndexUseCaching()) {
+

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r405053101
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[]

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r405052805
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java
 ##
 @@ -77,15 +80,15 @@ protected HoodieIndex(HoodieWriteConfig config) {
* present).
*/
   public abstract JavaRDD> 
tagLocation(JavaRDD> recordRDD, JavaSparkContext jsc,
-  HoodieTable hoodieTable) throws HoodieIndexException;
+   HoodieTable 
hoodieTable) throws HoodieIndexException;
 
 Review comment:
   I am not sure. my previous patches didn't have any such refactoring changes. 
I just do the regular intellij refactoring. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610564674
 
 
   Code:
   
   sparkSession
 .read
 .jdbc(
   url = jdbcConfig.url,
   table = table,
   columnName = "partition",
   lowerBound = 0,
   upperBound = jdbcConfig.partitionsCount.toInt,
   numPartitions = jdbcConfig.partitionsCount.toInt,
   connectionProperties = new Properties() {
 put("driver", jdbcConfig.driver)
 put("user", jdbcConfig.user)
 put("password", jdbcConfig.password)
   }
 )
 .withColumn("year", substring(col(jdbcConfig.dateColumnName), 0, 4))
 .withColumn("month", substring(col(jdbcConfig.dateColumnName), 6, 2))
 .withColumn("day", substring(col(jdbcConfig.dateColumnName), 9, 2))
 .write
 .option(HoodieWriteConfig.TABLE_NAME, hudiConfig.tableName)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
hudiConfig.recordKey)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
hudiConfig.precombineKey)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
hudiConfig.partitionPathKey)
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
classOf[ComplexKeyGenerator].getName)
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
 .option("hoodie.datasource.write.operation", writeOperation)
 .option("hoodie.bulkinsert.shuffle.parallelism", 
hudiConfig.bulkInsertParallelism)
 .option("hoodie.insert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.upsert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.cleaner.policy", 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
 .option("hoodie.cleaner.fileversions.retained", "1")
 .option("hoodie.metrics.graphite.host", hudiConfig.graphiteHost)
 .option("hoodie.metrics.graphite.port", hudiConfig.graphitePort)
 .option("hoodie.metrics.graphite.metric.prefix", 
hudiConfig.graphiteMetricPrefix)
 .format("org.apache.hudi")
 .mode(SaveMode.Append)
 .save(outputPath)
   
   This code is executing on Jenkins, with next parameters:
   
   docker run --rm -v ${PWD}:${PWD} -v /mnt/ml_data:/mnt/ml_data 
bde2020/spark-master:2.4.5-hadoop2.7 \
   bash ./spark/bin/spark-submit \
   --master "local[2]" \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf spark.local.dir=/mnt/ml_data \
   --conf spark.ui.enabled=false \
   --conf spark.driver.memory=4g \
   --conf spark.driver.memoryOverhead=1024 \
   --conf spark.driver.maxResultSize=2g \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.rdd.compress=true \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.hadoop.fs.defaultFS=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} \
   --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} \
   --conf spark.executorEnv.period.startDate=${date} \
   --conf spark.executorEnv.period.numDays=${numDays} \
   --conf spark.executorEnv.jdbc.url=${VERTICA_URL} \
   --conf spark.executorEnv.jdbc.user=${VERTICA_USER} \
   --conf spark.executorEnv.jdbc.password=${VERTICA_PWD} \
   --conf spark.executorEnv.jdbc.driver=${VERTICA_DRIVER}\
   --conf spark.executorEnv.jdbc.schemaName=mtu_owner \
   --conf spark.executorEnv.jdbc.tableName=ext_ml_data \
   --conf spark.executorEnv.jdbc.dateColumnName=hit_date \
   --conf spark.executorEnv.jdbc.partitionColumnName=hit_timestamp \
   --conf spark.executorEnv.jdbc.partitionsCount=8 \
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.executorEnv.hudi.tableName=ext_ml_data \
   --conf spark.executorEnv.hudi.recordKey=tds_cid \
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp \
   --conf spark.executorEnv.hudi.parallelism=8 \
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8 \
   --class mtu.spark.analytics.ExtMLDataToS3 \
   ${PWD}/ml-vertica-to-s3-hudi.jar
   
   I try to move 53 million records (The table contains 48 columns) from the 
Vertica database to s3 storage.
   Operation "bulk_insert" successfully completes and takes about 40-50 minutes.
   Operation "upsert" on the same records throws exceptions with OOM.
   
   On Hudi 0.5.1 "upsert" operations were hanging. I found the issue 
https://github.com/apache/incubator-hudi/issues/1328 and updated Hudi to 0.5.2. 
The problem with hanging, it seems to me, was resolved.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610564674
 
 
   Code:
   
   sparkSession
 .read
 .jdbc(
   url = jdbcConfig.url,
   table = table,
   columnName = "partition",
   lowerBound = 0,
   upperBound = jdbcConfig.partitionsCount.toInt,
   numPartitions = jdbcConfig.partitionsCount.toInt,
   connectionProperties = new Properties() {
 put("driver", jdbcConfig.driver)
 put("user", jdbcConfig.user)
 put("password", jdbcConfig.password)
   }
 )
 .withColumn("year", substring(col(jdbcConfig.dateColumnName), 0, 4))
 .withColumn("month", substring(col(jdbcConfig.dateColumnName), 6, 2))
 .withColumn("day", substring(col(jdbcConfig.dateColumnName), 9, 2))
 .write
 .option(HoodieWriteConfig.TABLE_NAME, hudiConfig.tableName)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
hudiConfig.recordKey)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
hudiConfig.precombineKey)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
hudiConfig.partitionPathKey)
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
classOf[ComplexKeyGenerator].getName)
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
 .option("hoodie.datasource.write.operation", writeOperation)
 .option("hoodie.bulkinsert.shuffle.parallelism", 
hudiConfig.bulkInsertParallelism)
 .option("hoodie.insert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.upsert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.cleaner.policy", 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
 .option("hoodie.cleaner.fileversions.retained", "1")
 .option("hoodie.metrics.graphite.host", hudiConfig.graphiteHost)
 .option("hoodie.metrics.graphite.port", hudiConfig.graphitePort)
 .option("hoodie.metrics.graphite.metric.prefix", 
hudiConfig.graphiteMetricPrefix)
 .format("org.apache.hudi")
 .mode(SaveMode.Append)
 .save(outputPath)
   
   This code is executing on Jenkins, with next parameters:
   
   docker run --rm -v ${PWD}:${PWD} -v /mnt/ml_data:/mnt/ml_data 
bde2020/spark-master:2.4.5-hadoop2.7 \
   bash ./spark/bin/spark-submit \
   --master "local[2]" \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf spark.local.dir=/mnt/ml_data \
   --conf spark.ui.enabled=false \
   --conf spark.driver.memory=4g \
   --conf spark.driver.memoryOverhead=1024 \
   --conf spark.driver.maxResultSize=2g \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.rdd.compress=true \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.hadoop.fs.defaultFS=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} \
   --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} \
   --conf spark.executorEnv.period.startDate=${date} \
   --conf spark.executorEnv.period.numDays=${numDays} \
   --conf spark.executorEnv.jdbc.url=${VERTICA_URL} \
   --conf spark.executorEnv.jdbc.user=${VERTICA_USER} \
   --conf spark.executorEnv.jdbc.password=${VERTICA_PWD} \
   --conf spark.executorEnv.jdbc.driver=${VERTICA_DRIVER}\
   --conf spark.executorEnv.jdbc.schemaName=mtu_owner \
   --conf spark.executorEnv.jdbc.tableName=ext_ml_data \
   --conf spark.executorEnv.jdbc.dateColumnName=hit_date \
   --conf spark.executorEnv.jdbc.partitionColumnName=hit_timestamp \
   --conf spark.executorEnv.jdbc.partitionsCount=8 \
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.executorEnv.hudi.tableName=ext_ml_data \
   --conf spark.executorEnv.hudi.recordKey=tds_cid \
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp \
   --conf spark.executorEnv.hudi.parallelism=8 \
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8 \
   --class mtu.spark.analytics.ExtMLDataToS3 \
   ${PWD}/ml-vertica-to-s3-hudi.jar
   
   I try to move 53 million records (The table contains 48 columns) from the 
Vertica database to s3 storage.
   Operation "bulk_insert" successfully completes and take about 40-50 minutes.
   Operation "upsert" on the same records throws exceptions with OOM.
   
   On Hudi 0.5.1 "upsert" operations were hanging. I found the issue 
https://github.com/apache/incubator-hudi/issues/1328 and updated Hudi to 0.5.2. 
The problem, it seems to me, was resolved.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610564674
 
 
   Code:
   
   sparkSession
 .read
 .jdbc(
   url = jdbcConfig.url,
   table = table,
   columnName = "partition",
   lowerBound = 0,
   upperBound = jdbcConfig.partitionsCount.toInt,
   numPartitions = jdbcConfig.partitionsCount.toInt,
   connectionProperties = new Properties() {
 put("driver", jdbcConfig.driver)
 put("user", jdbcConfig.user)
 put("password", jdbcConfig.password)
   }
 )
 .withColumn("year", substring(col(jdbcConfig.dateColumnName), 0, 4))
 .withColumn("month", substring(col(jdbcConfig.dateColumnName), 6, 2))
 .withColumn("day", substring(col(jdbcConfig.dateColumnName), 9, 2))
 .write
 .option(HoodieWriteConfig.TABLE_NAME, hudiConfig.tableName)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
hudiConfig.recordKey)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
hudiConfig.precombineKey)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
hudiConfig.partitionPathKey)
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
classOf[ComplexKeyGenerator].getName)
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
 .option("hoodie.datasource.write.operation", writeOperation)
 .option("hoodie.bulkinsert.shuffle.parallelism", 
hudiConfig.bulkInsertParallelism)
 .option("hoodie.insert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.upsert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.cleaner.policy", 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
 .option("hoodie.cleaner.fileversions.retained", "1")
 .option("hoodie.metrics.graphite.host", hudiConfig.graphiteHost)
 .option("hoodie.metrics.graphite.port", hudiConfig.graphitePort)
 .option("hoodie.metrics.graphite.metric.prefix", 
hudiConfig.graphiteMetricPrefix)
 .format("org.apache.hudi")
 .mode(SaveMode.Append)
 .save(outputPath)
   
   This code is executing on Jenkins, with next parameters:
   
   docker run --rm -v ${PWD}:${PWD} -v /mnt/ml_data:/mnt/ml_data 
bde2020/spark-master:2.4.5-hadoop2.7 \
   bash ./spark/bin/spark-submit \
   --master "local[2]" \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf spark.local.dir=/mnt/ml_data \
   --conf spark.ui.enabled=false \
   --conf spark.driver.memory=4g \
   --conf spark.driver.memoryOverhead=1024 \
   --conf spark.driver.maxResultSize=2g \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.rdd.compress=true \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.hadoop.fs.defaultFS=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} \
   --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} \
   --conf spark.executorEnv.period.startDate=${date} \
   --conf spark.executorEnv.period.numDays=${numDays} \
   --conf spark.executorEnv.jdbc.url=${VERTICA_URL} \
   --conf spark.executorEnv.jdbc.user=${VERTICA_USER} \
   --conf spark.executorEnv.jdbc.password=${VERTICA_PWD} \
   --conf spark.executorEnv.jdbc.driver=${VERTICA_DRIVER}\
   --conf spark.executorEnv.jdbc.schemaName=mtu_owner \
   --conf spark.executorEnv.jdbc.tableName=ext_ml_data \
   --conf spark.executorEnv.jdbc.dateColumnName=hit_date \
   --conf spark.executorEnv.jdbc.partitionColumnName=hit_timestamp \
   --conf spark.executorEnv.jdbc.partitionsCount=8 \
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.executorEnv.hudi.tableName=ext_ml_data \
   --conf spark.executorEnv.hudi.recordKey=tds_cid \
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp \
   --conf spark.executorEnv.hudi.parallelism=8 \
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8 \
   --class mtu.spark.analytics.ExtMLDataToS3 \
   ${PWD}/ml-vertica-to-s3-hudi.jar
   
   I try to move 53 million records (The table contains 48 columns) from the 
Vertica database to s3 storage.
   Operation "bulk_insert" successfully completes and take about 40-50 minutes.
   Operation "upsert" on the same records throws exceptions with OOM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610564674
 
 
   Code:
   `
   sparkSession
 .read
 .jdbc(
   url = jdbcConfig.url,
   table = table,
   columnName = "partition",
   lowerBound = 0,
   upperBound = jdbcConfig.partitionsCount.toInt,
   numPartitions = jdbcConfig.partitionsCount.toInt,
   connectionProperties = new Properties() {
 put("driver", jdbcConfig.driver)
 put("user", jdbcConfig.user)
 put("password", jdbcConfig.password)
   }
 )
 .withColumn("year", substring(col(jdbcConfig.dateColumnName), 0, 4))
 .withColumn("month", substring(col(jdbcConfig.dateColumnName), 6, 2))
 .withColumn("day", substring(col(jdbcConfig.dateColumnName), 9, 2))
 .write
 .option(HoodieWriteConfig.TABLE_NAME, hudiConfig.tableName)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
hudiConfig.recordKey)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
hudiConfig.precombineKey)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
hudiConfig.partitionPathKey)
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
classOf[ComplexKeyGenerator].getName)
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
 .option("hoodie.datasource.write.operation", writeOperation)
 .option("hoodie.bulkinsert.shuffle.parallelism", 
hudiConfig.bulkInsertParallelism)
 .option("hoodie.insert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.upsert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.cleaner.policy", 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
 .option("hoodie.cleaner.fileversions.retained", "1")
 .option("hoodie.metrics.graphite.host", hudiConfig.graphiteHost)
 .option("hoodie.metrics.graphite.port", hudiConfig.graphitePort)
 .option("hoodie.metrics.graphite.metric.prefix", 
hudiConfig.graphiteMetricPrefix)
 .format("org.apache.hudi")
 .mode(SaveMode.Append)
 .save(outputPath)
   `
   This code is executing on Jenkins, with next parameters:
   `
   docker run --rm -v ${PWD}:${PWD} -v /mnt/ml_data:/mnt/ml_data 
bde2020/spark-master:2.4.5-hadoop2.7 \
   bash ./spark/bin/spark-submit \
   --master "local[2]" \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf spark.local.dir=/mnt/ml_data \
   --conf spark.ui.enabled=false \
   --conf spark.driver.memory=4g \
   --conf spark.driver.memoryOverhead=1024 \
   --conf spark.driver.maxResultSize=2g \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.rdd.compress=true \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.hadoop.fs.defaultFS=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} \
   --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} \
   --conf spark.executorEnv.period.startDate=${date} \
   --conf spark.executorEnv.period.numDays=${numDays} \
   --conf spark.executorEnv.jdbc.url=${VERTICA_URL} \
   --conf spark.executorEnv.jdbc.user=${VERTICA_USER} \
   --conf spark.executorEnv.jdbc.password=${VERTICA_PWD} \
   --conf spark.executorEnv.jdbc.driver=${VERTICA_DRIVER}\
   --conf spark.executorEnv.jdbc.schemaName=mtu_owner \
   --conf spark.executorEnv.jdbc.tableName=ext_ml_data \
   --conf spark.executorEnv.jdbc.dateColumnName=hit_date \
   --conf spark.executorEnv.jdbc.partitionColumnName=hit_timestamp \
   --conf spark.executorEnv.jdbc.partitionsCount=8 \
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.executorEnv.hudi.tableName=ext_ml_data \
   --conf spark.executorEnv.hudi.recordKey=tds_cid \
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp \
   --conf spark.executorEnv.hudi.parallelism=8 \
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8 \
   --class mtu.spark.analytics.ExtMLDataToS3 \
   ${PWD}/ml-vertica-to-s3-hudi.jar
   `


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610564674
 
 
   Code:
   
   sparkSession
 .read
 .jdbc(
   url = jdbcConfig.url,
   table = table,
   columnName = "partition",
   lowerBound = 0,
   upperBound = jdbcConfig.partitionsCount.toInt,
   numPartitions = jdbcConfig.partitionsCount.toInt,
   connectionProperties = new Properties() {
 put("driver", jdbcConfig.driver)
 put("user", jdbcConfig.user)
 put("password", jdbcConfig.password)
   }
 )
 .withColumn("year", substring(col(jdbcConfig.dateColumnName), 0, 4))
 .withColumn("month", substring(col(jdbcConfig.dateColumnName), 6, 2))
 .withColumn("day", substring(col(jdbcConfig.dateColumnName), 9, 2))
 .write
 .option(HoodieWriteConfig.TABLE_NAME, hudiConfig.tableName)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
hudiConfig.recordKey)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
hudiConfig.precombineKey)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
hudiConfig.partitionPathKey)
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
classOf[ComplexKeyGenerator].getName)
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
 .option("hoodie.datasource.write.operation", writeOperation)
 .option("hoodie.bulkinsert.shuffle.parallelism", 
hudiConfig.bulkInsertParallelism)
 .option("hoodie.insert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.upsert.shuffle.parallelism", hudiConfig.parallelism)
 .option("hoodie.cleaner.policy", 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
 .option("hoodie.cleaner.fileversions.retained", "1")
 .option("hoodie.metrics.graphite.host", hudiConfig.graphiteHost)
 .option("hoodie.metrics.graphite.port", hudiConfig.graphitePort)
 .option("hoodie.metrics.graphite.metric.prefix", 
hudiConfig.graphiteMetricPrefix)
 .format("org.apache.hudi")
 .mode(SaveMode.Append)
 .save(outputPath)
   
   This code is executing on Jenkins, with next parameters:
   
   docker run --rm -v ${PWD}:${PWD} -v /mnt/ml_data:/mnt/ml_data 
bde2020/spark-master:2.4.5-hadoop2.7 \
   bash ./spark/bin/spark-submit \
   --master "local[2]" \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf spark.local.dir=/mnt/ml_data \
   --conf spark.ui.enabled=false \
   --conf spark.driver.memory=4g \
   --conf spark.driver.memoryOverhead=1024 \
   --conf spark.driver.maxResultSize=2g \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.rdd.compress=true \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.hadoop.fs.defaultFS=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} \
   --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} \
   --conf spark.executorEnv.period.startDate=${date} \
   --conf spark.executorEnv.period.numDays=${numDays} \
   --conf spark.executorEnv.jdbc.url=${VERTICA_URL} \
   --conf spark.executorEnv.jdbc.user=${VERTICA_USER} \
   --conf spark.executorEnv.jdbc.password=${VERTICA_PWD} \
   --conf spark.executorEnv.jdbc.driver=${VERTICA_DRIVER}\
   --conf spark.executorEnv.jdbc.schemaName=mtu_owner \
   --conf spark.executorEnv.jdbc.tableName=ext_ml_data \
   --conf spark.executorEnv.jdbc.dateColumnName=hit_date \
   --conf spark.executorEnv.jdbc.partitionColumnName=hit_timestamp \
   --conf spark.executorEnv.jdbc.partitionsCount=8 \
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi \
   --conf spark.executorEnv.hudi.tableName=ext_ml_data \
   --conf spark.executorEnv.hudi.recordKey=tds_cid \
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp \
   --conf spark.executorEnv.hudi.parallelism=8 \
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8 \
   --class mtu.spark.analytics.ExtMLDataToS3 \
   ${PWD}/ml-vertica-to-s3-hudi.jar


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r405005385
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -169,12 +180,16 @@ public Operation convert(String value) throws 
ParameterException {
 required = true)
 public String targetBasePath;
 
+// TODO: How to obtain hive configs to register?
 @Parameter(names = {"--target-table"}, description = "name of the target 
table in Hive", required = true)
 public String targetTableName;
 
 @Parameter(names = {"--table-type"}, description = "Type of table. 
COPY_ON_WRITE (or) MERGE_ON_READ", required = true)
 public String tableType;
 
+@Parameter(names = {"--config-folder"}, description = "Path to folder 
which contains all the properties file", required = true)
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-105) DeltaStreamer Kafka Ingestion does not handle invalid offsets

2020-04-07 Thread Hoang Ngo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077475#comment-17077475
 ] 

Hoang Ngo commented on HUDI-105:


Hi [~vinoth],

Do you know if this fix is applied in hudi 0.5.0? I have same problem here. 
This is my spark-submit

spark-submit --conf 
'spark.jars=/usr/lib/hudi/hudi-hadoop-mr-bundle-0.5.0-incubating.jar,/usr/lib/hudi/hudi-spark-bundle-0.5.0-incubating.jar,/usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar'
 \
 --num-executors 1 \
 --executor-memory 2g \
 --driver-memory 2g \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
/usr/lib/hudi/hudi-utilities-bundle-*.jar` \
 --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
 --storage-type COPY_ON_WRITE \
 --target-base-path s3://mybucket/tmp/hudidata-delta2/ \
 --target-table hudidata-delta2 \
 --props s3://mybucket/tmp/emr/hudipoc/a.properties \
 --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

 

 

This is my trace:
Exception in thread "main" org.apache.spark.SparkException: Offsets not 
available on leader: OffsetRange(topic: 'mytopic', partition: 1, range: [0 -> 
25447]),OffsetRange(topic: 'mytopic', partition: 2, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 3, range: [0 -> 
42775]),OffsetRange(topic: 'mytopic', partition: 7, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 8, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 13, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 16, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 17, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 18, range: [0 -> 
6494]),OffsetRange(topic: 'mytopic', partition: 20, range: [0 -> 
115661]),OffsetRange(topic: 'mytopic', partition: 21, range: [0 -> 115657])
at 
org.apache.spark.streaming.kafka.KafkaUtils$.org$apache$spark$streaming$kafka$KafkaUtils$$checkOffsets(KafkaUtils.scala:201)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$1.apply(KafkaUtils.scala:254)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$1.apply(KafkaUtils.scala:250)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at org.apache.spark.streaming.kafka.KafkaUtils$.createRDD(KafkaUtils.scala:250)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$3.apply(KafkaUtils.scala:339)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$3.apply(KafkaUtils.scala:334)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at org.apache.spark.streaming.kafka.KafkaUtils$.createRDD(KafkaUtils.scala:334)
at org.apache.spark.streaming.kafka.KafkaUtils.createRDD(KafkaUtils.scala)
at 
org.apache.hudi.utilities.sources.AvroKafkaSource.toRDD(AvroKafkaSource.java:67)
at 
org.apache.hudi.utilities.sources.AvroKafkaSource.fetchNewData(AvroKafkaSource.java:61)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:71)
at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:62)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:292)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/04/07 17:45:12 INFO ShutdownHookManager: Shutdown hook called

> DeltaStreamer Kafka Ingestion does not handle invalid offsets
>

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

bvaradar commented on a change in pull request #1150: [HUDI-288]: Add support 
for ingesting multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r404990360
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -169,12 +180,16 @@ public Operation convert(String value) throws 
ParameterException {
 required = true)
 public String targetBasePath;
 
+// TODO: How to obtain hive configs to register?
 @Parameter(names = {"--target-table"}, description = "name of the target 
table in Hive", required = true)
 public String targetTableName;
 
 @Parameter(names = {"--table-type"}, description = "Type of table. 
COPY_ON_WRITE (or) MERGE_ON_READ", required = true)
 public String tableType;
 
+@Parameter(names = {"--config-folder"}, description = "Path to folder 
which contains all the properties file", required = true)
 
 Review comment:
   Can we remove this ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

bvaradar commented on a change in pull request #1150: [HUDI-288]: Add support 
for ingesting multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r404992549
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[]

[GitHub] [incubator-hudi] vinothchandar commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

vinothchandar commented on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610504319
 
 
   Is this real data or can you share a reproducible snippet of code? 
Especially with these local microbenchmarks, its useful to understand as the 
small costs that typically don't matter in real cluster, kind of tend to 
amplify.. 
   
   From the logs, it seems like 
   1) bulk_insert is succeeding and upsert is what's failing... and it's 
failing during the write phase, when we actually allocate some memory to do the 
merge.. 
   
   
   2) From the logs below, it seems like you have a lot of data potentially for 
a single node.. How much total data do you have in those 53M records? (That's a 
key metric for runtime, more than number of records. Hudi does not have a 
maxiumum records limit etc per se)
   
   ```
   20/04/07 08:02:55 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (1 time so far)
   20/04/07 08:03:04 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1329.9 MB to disk (1 time so far)
   20/04/07 08:03:04 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1325.7 MB to disk (1 time so far)
   20/04/07 08:03:07 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1385.6 MB to disk (1 time so far)
   20/04/07 08:03:25 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (2 times so far)
   20/04/07 08:03:41 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1325.5 MB to disk (2 times so far)
   20/04/07 08:03:43 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1325.4 MB to disk (2 times so far)
   20/04/07 08:03:58 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1381.4 MB to disk (2 times so far)
   20/04/07 08:04:08 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (3 times so far)
   20/04/07 08:04:24 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1325.4 MB to disk (3 times so far)
   20/04/07 08:04:28 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1327.7 MB to disk (3 times so far)
   20/04/07 08:04:57 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (4 times so far)
   20/04/07 08:04:59 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1491.8 MB to disk (3 times so far)
   20/04/07 08:05:14 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1363.9 MB to disk (4 times so far)
   20/04/07 08:05:16 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1325.4 MB to disk (4 times so far)
   20/04/07 08:05:47 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1349.8 MB to disk (4 times so far)
   20/04/07 08:06:05 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1300.9 MB to disk (5 times so far)
   ```
   
   I suspect what's happening is that spark memory is actually full (Hudi 
caches input to derive workload profile etc and typically advised to keep input 
data in memory) and it keeps spilling to disk, slowing everything down.. (more 
of a spark tuning thing)... But things don't break until Hudi tries to allocate 
some memory on its own, at which point the heap is full.. 
   
   Can you give this a shot on a cluster?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong commented on issue #1452: [HUDI-740]Fix can not specify the sparkMaster and code clean for SparkUtil

hddong commented on issue #1452: [HUDI-740]Fix can not specify the sparkMaster 
and code clean for SparkUtil
URL: https://github.com/apache/incubator-hudi/pull/1452#issuecomment-610457107
 
 
   > again, is this PR not only for `clean` command?
   
   Sorry for change it late :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-740) Fix can not specify the sparkMaster and code clean for SparkUtil

2020-04-07 Thread hong dongdong (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong updated HUDI-740:
---
Summary: Fix can not specify the sparkMaster and code clean for SparkUtil  
(was: [HUDI-740]Fix can not specify the sparkMaster of clean and compact  
commands)

> Fix can not specify the sparkMaster and code clean for SparkUtil
> 
>
> Key: HUDI-740
> URL: https://issues.apache.org/jira/browse/HUDI-740
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now, We can specify the sparkMaster of cleans run command, but it's not work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] hddong commented on a change in pull request #1452: [HUDI-740]Fix can not specify the sparkMaster of clean and compact commands

hddong commented on a change in pull request #1452: [HUDI-740]Fix can not 
specify the sparkMaster of clean and compact  commands
URL: https://github.com/apache/incubator-hudi/pull/1452#discussion_r404900670
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/SparkUtil.java
 ##
 @@ -61,9 +62,14 @@ public static SparkLauncher initLauncher(String 
propertiesFile) throws URISyntax
   }
 
   public static JavaSparkContext initJavaSparkConf(String name) {
+return initJavaSparkConf(name, Option.empty(), Option.empty());
+  }
+
+  public static JavaSparkContext initJavaSparkConf(String name, Option 
master,
+   Option 
executorMemory) {
 SparkConf sparkConf = new SparkConf().setAppName(name);
 
-String defMasterFromEnv = sparkConf.getenv("SPARK_MASTER");
+String defMasterFromEnv = master.orElse(sparkConf.getenv("SPARK_MASTER"));
 
 Review comment:
   @yanghua yes, necessary, how about add a file `HoodieCliSparkConfig`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-07 Thread Bhavani Sudha (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077314#comment-17077314
 ] 

Bhavani Sudha commented on HUDI-69:
---

[~garyli1019] Yes the InputPathHandler will be able to provide MOR snapshot 
paths. However I think the FileInputFormat filters out hidden files by default. 
The log files start with  a `.` and hence are treated as hidden files by the 
FileInputFormat class. Given this context, when we do super.listStatus from 
HoodieParquetInputFormat - 
[https://github.com/apache/incubator-hudi/blob/b5d093a21bbb19f164fbc549277188f2151232a8/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java#L107]
 the log files are not listed.

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for 
Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-610432304
 
 
   TestMergeOnReadTable or TestClientCopyOnWriteStorage etc that will do a full 
upsert dag for cow and mor are good starting points.. but really, we need to 
run an upsert with a real job to ensure these values also show up in real 
deployments


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-740) [HUDI-740]Fix can not specify the sparkMaster of clean and compact commands

2020-04-07 Thread hong dongdong (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong updated HUDI-740:
---
Summary: [HUDI-740]Fix can not specify the sparkMaster of clean and compact 
 commands  (was: Fix can not specify the sparkMaster of cleans run command)

> [HUDI-740]Fix can not specify the sparkMaster of clean and compact  commands
> 
>
> Key: HUDI-740
> URL: https://issues.apache.org/jira/browse/HUDI-740
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now, We can specify the sparkMaster of cleans run command, but it's not work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610307143
 
 
   So, the process took 2h 40m (local[4] and driver memory 10gb) and thrown 
"java.lang.OutOfMemoryError: GC overhead limit exceeded".
   Log https://drive.google.com/open?id=1Ark99uXcdp5_4Ns7-DdaSkMfkahJgK6n.
   
   Is it normal, that the process took such time?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610307143
 
 
   So, the process took 2h 40m and thrown "java.lang.OutOfMemoryError: GC 
overhead limit exceeded".
   Log https://drive.google.com/open?id=1Ark99uXcdp5_4Ns7-DdaSkMfkahJgK6n.
   
   Is it normal, that the process took such time?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on issue #1452: [HUDI-740]Fix can not specify the sparkMaster of cleans run command

yanghua commented on issue #1452: [HUDI-740]Fix can not specify the sparkMaster 
of cleans run command
URL: https://github.com/apache/incubator-hudi/pull/1452#issuecomment-610298785
 
 
   again, is this PR not only for `clean` command?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong commented on issue #1490: [HUDI-700]Add unit test for FileSystemViewCommand

hddong commented on issue #1490: [HUDI-700]Add unit test for 
FileSystemViewCommand
URL: https://github.com/apache/incubator-hudi/pull/1490#issuecomment-610298123
 
 
   @yanghua @vinothchandar please have a review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1452: [HUDI-740]Fix can not specify the sparkMaster of cleans run command

yanghua commented on a change in pull request #1452: [HUDI-740]Fix can not 
specify the sparkMaster of cleans run command
URL: https://github.com/apache/incubator-hudi/pull/1452#discussion_r404690620
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/SparkUtil.java
 ##
 @@ -61,9 +62,14 @@ public static SparkLauncher initLauncher(String 
propertiesFile) throws URISyntax
   }
 
   public static JavaSparkContext initJavaSparkConf(String name) {
+return initJavaSparkConf(name, Option.empty(), Option.empty());
+  }
+
+  public static JavaSparkContext initJavaSparkConf(String name, Option 
master,
+   Option 
executorMemory) {
 SparkConf sparkConf = new SparkConf().setAppName(name);
 
-String defMasterFromEnv = sparkConf.getenv("SPARK_MASTER");
+String defMasterFromEnv = master.orElse(sparkConf.getenv("SPARK_MASTER"));
 
 Review comment:
   Actually, I mean, we should avoid using string literals e.g. `SPARK_MASTER 
`, `spark.hadoop.mapred.output.compress` and so on. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong commented on a change in pull request #1452: [HUDI-740]Fix can not specify the sparkMaster of cleans run command

hddong commented on a change in pull request #1452: [HUDI-740]Fix can not 
specify the sparkMaster of cleans run command
URL: https://github.com/apache/incubator-hudi/pull/1452#discussion_r404685892
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/SparkUtil.java
 ##
 @@ -61,9 +62,14 @@ public static SparkLauncher initLauncher(String 
propertiesFile) throws URISyntax
   }
 
   public static JavaSparkContext initJavaSparkConf(String name) {
+return initJavaSparkConf(name, Option.empty(), Option.empty());
+  }
+
+  public static JavaSparkContext initJavaSparkConf(String name, Option 
master,
+   Option 
executorMemory) {
 SparkConf sparkConf = new SparkConf().setAppName(name);
 
-String defMasterFromEnv = sparkConf.getenv("SPARK_MASTER");
+String defMasterFromEnv = master.orElse(sparkConf.getenv("SPARK_MASTER"));
 
 Review comment:
   > Can we extract the string literals in this method to be constant fields?
   
   @yanghua There is constant fields `DEFAULT_SPARK_MASTER` for master.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #143: Tracking ticket for folks to be added to slack group

lamber-ken commented on issue #143: Tracking ticket for folks to be added to 
slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-610257073
 
 
   hi @malanb5 @tverdokhlebd, done and welcome : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] malanb5 commented on issue #143: Tracking ticket for folks to be added to slack group

malanb5 commented on issue #143: Tracking ticket for folks to be added to slack 
group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-610243459
 
 
   Please add me too mala...@gmail.com


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd removed a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 30M records

tverdokhlebd removed a comment on issue #1491: [SUPPORT] OutOfMemoryError 
during upsert 30M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610221356
 
 
   Tried to set this config:
   
   - local[4]
   - driver memory 12GB
   - driver memoryOverhead 2048
   
   And result:
   
   20/04/07 07:05:38 INFO ExternalAppendOnlyMap: Thread 132 spilling in-memory 
map of 1598.4 MB to disk (1 time so far)
   20/04/07 07:05:39 INFO ExternalAppendOnlyMap: Thread 130 spilling in-memory 
map of 1598.4 MB to disk (1 time so far)
   OpenJDK 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00065880, 1736441856, 0) failed; error='Out of 
memory' (errno=12)
   There is insufficient memory for the Java Runtime Environment to continue.
   Native memory allocation (mmap) failed to map 1736441856 bytes for 
committing reserved memory.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1493: [MINOR] remove Hive dependency from delta streamer

codecov-io edited a comment on issue #1493: [MINOR] remove Hive dependency from 
delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1493#issuecomment-610227776
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=h1) 
Report
   > Merging 
[#1493](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/eaf6cc2d90bf27c0d9414a4ea18dbd1b61f58e50=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1493/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1493  +/-   ##
   
   + Coverage 71.54%   71.58%   +0.03% 
   + Complexity  261  260   -1 
   
 Files   336  336  
 Lines 1574415741   -3 
 Branches   1610 1610  
   
   + Hits  1126411268   +4 
   + Misses 3759 3752   -7 
 Partials721  721  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.00% <100.00%> (ø)` | `38.00 <0.00> (ø)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `78.28% <100.00%> (-0.33%)` | `7.00 <1.00> (-1.00)` | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | `88.88% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `72.22% <0.00%> (+13.88%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=footer).
 Last update 
[eaf6cc2...8d5582e](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io commented on issue #1493: [MINOR] remove Hive dependency from delta streamer

codecov-io commented on issue #1493: [MINOR] remove Hive dependency from delta 
streamer
URL: https://github.com/apache/incubator-hudi/pull/1493#issuecomment-610227776
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=h1) 
Report
   > Merging 
[#1493](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/eaf6cc2d90bf27c0d9414a4ea18dbd1b61f58e50=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1493/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1493  +/-   ##
   
   + Coverage 71.54%   71.58%   +0.03% 
   + Complexity  261  260   -1 
   
 Files   336  336  
 Lines 1574415741   -3 
 Branches   1610 1610  
   
   + Hits  1126411268   +4 
   + Misses 3759 3752   -7 
 Partials721  721  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.00% <100.00%> (ø)` | `38.00 <0.00> (ø)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `78.28% <100.00%> (-0.33%)` | `7.00 <1.00> (-1.00)` | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | `88.88% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `72.22% <0.00%> (+13.88%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1493/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=footer).
 Last update 
[eaf6cc2...8d5582e](https://codecov.io/gh/apache/incubator-hudi/pull/1493?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 30M records

tverdokhlebd edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 30M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610221356
 
 
   Tried to set this config:
   
   - local[4]
   - driver memory 12GB
   - driver memoryOverhead 2048
   
   And result:
   
   20/04/07 07:05:38 INFO ExternalAppendOnlyMap: Thread 132 spilling in-memory 
map of 1598.4 MB to disk (1 time so far)
   20/04/07 07:05:39 INFO ExternalAppendOnlyMap: Thread 130 spilling in-memory 
map of 1598.4 MB to disk (1 time so far)
   OpenJDK 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00065880, 1736441856, 0) failed; error='Out of 
memory' (errno=12)
   There is insufficient memory for the Java Runtime Environment to continue.
   Native memory allocation (mmap) failed to map 1736441856 bytes for 
committing reserved memory.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 30M records

tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
30M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610221356
 
 
   Tried to set this config:
   
   - local[4]
   - driver memory 12GB
   - driver memoryOverhead 2048
   
   And result:
   
   20/04/07 07:05:38 INFO ExternalAppendOnlyMap: Thread 132 spilling in-memory 
map of 1598.4 MB to disk (1 time so far)
   20/04/07 07:05:39 INFO ExternalAppendOnlyMap: Thread 130 spilling in-memory 
map of 1598.4 MB to disk (1 time so far)
   OpenJDK 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00065880, 1736441856, 0) failed; error='Out of 
memory' (errno=12)
   #
   # There is insufficient memory for the Java Runtime Environment to continue.
   # Native memory allocation (mmap) failed to map 1736441856 bytes for 
committing reserved memory.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch hudi_test_suite_refactor updated (29b4fdf -> 3e2e710)

2020-04-07 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 29b4fdf  [HUDI-394] Provide a basic implementation of test suite
 add 3e2e710  Fix Compilation Issues + Port Bug Fixes

No new revisions were added by this update.

Summary of changes:
 .../src/test/java/TestComplexKeyGenerator.java |  2 +-
 hudi-test-suite/pom.xml|  8 ++
 .../testsuite/configuration/DFSDeltaConfig.java|  2 +-
 .../hudi/testsuite/configuration/DeltaConfig.java  |  2 +-
 .../hudi/testsuite/dag/nodes/BulkInsertNode.java   |  2 +-
 .../hudi/testsuite/dag/nodes/CompactNode.java  |  2 +-
 .../hudi/testsuite/dag/nodes/HiveQueryNode.java|  1 +
 .../hudi/testsuite/dag/nodes/InsertNode.java   |  2 +-
 .../hudi/testsuite/dag/nodes/UpsertNode.java   |  2 +-
 .../generator/LazyRecordGeneratorIterator.java |  2 +-
 .../helpers/DFSTestSuitePathSelector.java  |  2 +-
 .../testsuite/job/HoodieDeltaStreamerWrapper.java  |  2 +-
 .../hudi/testsuite/job/HoodieTestSuiteJob.java |  6 ++--
 .../hudi/testsuite/reader/DFSDeltaInputReader.java |  2 +-
 .../reader/DFSHoodieDatasetInputReader.java|  8 +++---
 .../testsuite/writer/AvroDeltaInputWriter.java |  2 +-
 .../apache/hudi/testsuite/writer/DeltaWriter.java  |  6 ++--
 .../hudi/testsuite/TestDFSDeltaWriterAdapter.java  |  4 +--
 .../hudi/testsuite/TestFileDeltaInputWriter.java   |  2 +-
 .../hudi/testsuite/dag/ComplexDagGenerator.java|  2 +-
 .../TestGenericRecordPayloadGenerator.java |  2 +-
 .../hudi/testsuite/job/TestHoodieTestSuiteJob.java |  2 +-
 .../reader/TestDFSAvroDeltaInputReader.java|  2 +-
 .../reader/TestDFSHoodieDatasetInputReader.java|  6 ++--
 .../hudi/testsuite/writer/TestDeltaWriter.java |  4 +--
 .../deltastreamer/HoodieDeltaStreamer.java |  4 +++
 .../apache/hudi/utilities/UtilitiesTestBase.java   |  1 -
 packaging/hudi-test-suite-bundle/pom.xml   | 33 --
 28 files changed, 54 insertions(+), 61 deletions(-)

[GitHub] [incubator-hudi] n3nash merged pull request #1494: Fix Compilation Issues + Port Bug Fixes

n3nash merged pull request #1494: Fix Compilation Issues + Port Bug Fixes
URL: https://github.com/apache/incubator-hudi/pull/1494
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] modi95 opened a new pull request #1494: Fix Compilation Issues + Port Bug Fixes

modi95 opened a new pull request #1494: Fix Compilation Issues + Port Bug Fixes
URL: https://github.com/apache/incubator-hudi/pull/1494
 
 
   ## What is the purpose of the pull request
   
   - Port bug fixes to test suite
   - Additional features to test suite will be added in a separate PR
   - Fix compilation issues
   
   ## Verify this pull request
   
   - This pull request is a trivial rework / code cleanup without any test 
coverage.
   - I have yet to actually run the tests. Will do this momentarily. 
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] loagosad commented on issue #1438: How to get the file name corresponding to HoodieKey through the GlobalBloomIndex