[jira] [Assigned] (HUDI-998) Introduce a robot to build testing website automatically

2020-06-05 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-998:
---

Assignee: lamber-ken

> Introduce a robot to build testing website automatically
> 
>
> Key: HUDI-998
> URL: https://issues.apache.org/jira/browse/HUDI-998
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-998) Introduce a robot to build testing website automatically

2020-06-05 Thread lamber-ken (Jira)
lamber-ken created HUDI-998:
---

 Summary: Introduce a robot to build testing website automatically
 Key: HUDI-998
 URL: https://issues.apache.org/jira/browse/HUDI-998
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: lamber-ken






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-998) Introduce a robot to build testing website automatically

2020-06-05 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-998:

Status: Open  (was: New)

> Introduce a robot to build testing website automatically
> 
>
> Key: HUDI-998
> URL: https://issues.apache.org/jira/browse/HUDI-998
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-997) Debug possible memory leaks when running tests in hudi-client

2020-06-04 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126090#comment-17126090
 ] 

lamber-ken commented on HUDI-997:
-

Sure, thanks :)

> Debug possible memory leaks when running tests in hudi-client
> -
>
> Key: HUDI-997
> URL: https://issues.apache.org/jira/browse/HUDI-997
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: lamber-ken
>Priority: Major
>
> Using visualvm, noticed the memory leak pattern of gradual increase in 
> reachable memory (after GC) as the tests progress. 
> Some possible candidates where I noticed marked increase in memory while the 
> tests were running : 
>  
> [INFO] Running org.apache.hudi.table.TestCleaner
> [INFO] Running org.apache.hudi.table.TestHoodieMergeOnReadTable
> [INFO] Running 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> [INFO] Running org.apache.hudi.table.action.compact.TestAsyncCompaction
> [INFO] Running org.apache.hudi.index.hbase.TestHBaseIndex
> [INFO] Running org.apache.hudi.index.TestHoodieIndex
> [INFO] Running org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-997) Debug possible memory leaks when running tests in hudi-client

2020-06-04 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-997:
---

Assignee: lamber-ken

> Debug possible memory leaks when running tests in hudi-client
> -
>
> Key: HUDI-997
> URL: https://issues.apache.org/jira/browse/HUDI-997
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: lamber-ken
>Priority: Major
>
> Using visualvm, noticed the memory leak pattern of gradual increase in 
> reachable memory (after GC) as the tests progress. 
> Some possible candidates where I noticed marked increase in memory while the 
> tests were running : 
>  
> [INFO] Running org.apache.hudi.table.TestCleaner
> [INFO] Running org.apache.hudi.table.TestHoodieMergeOnReadTable
> [INFO] Running 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> [INFO] Running org.apache.hudi.table.action.compact.TestAsyncCompaction
> [INFO] Running org.apache.hudi.index.hbase.TestHBaseIndex
> [INFO] Running org.apache.hudi.index.TestHoodieIndex
> [INFO] Running org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-986) Support staging site for per pull request

2020-06-01 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-986:

Status: Open  (was: New)

> Support staging site for per pull request
> -
>
> Key: HUDI-986
> URL: https://issues.apache.org/jira/browse/HUDI-986
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-986) Support staging site for per pull request

2020-06-01 Thread lamber-ken (Jira)
lamber-ken created HUDI-986:
---

 Summary: Support staging site for per pull request
 Key: HUDI-986
 URL: https://issues.apache.org/jira/browse/HUDI-986
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: lamber-ken






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-986) Support staging site for per pull request

2020-06-01 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-986:
---

Assignee: lamber-ken

> Support staging site for per pull request
> -
>
> Key: HUDI-986
> URL: https://issues.apache.org/jira/browse/HUDI-986
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-986) Support staging site for per pull request

2020-06-01 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-986:

Description: support staging site for per pull request

> Support staging site for per pull request
> -
>
> Key: HUDI-986
> URL: https://issues.apache.org/jira/browse/HUDI-986
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> support staging site for per pull request



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-985) Introduce rerun ci bot

2020-05-31 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-985:

Status: Open  (was: New)

> Introduce rerun ci bot
> --
>
> Key: HUDI-985
> URL: https://issues.apache.org/jira/browse/HUDI-985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Introduce rerun ci bot, help to rerun tests
>  
> Replace
> {code:java}
> git commit --amend/git push --force
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-985) Introduce rerun ci bot

2020-05-31 Thread lamber-ken (Jira)
lamber-ken created HUDI-985:
---

 Summary: Introduce rerun ci bot
 Key: HUDI-985
 URL: https://issues.apache.org/jira/browse/HUDI-985
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: lamber-ken


Introduce rerun ci bot, help to rerun tests

 

Replace
{code:java}
git commit --amend/git push --force
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-985) Introduce rerun ci bot

2020-05-31 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-985:
---

Assignee: lamber-ken

> Introduce rerun ci bot
> --
>
> Key: HUDI-985
> URL: https://issues.apache.org/jira/browse/HUDI-985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Introduce rerun ci bot, help to rerun tests
>  
> Replace
> {code:java}
> git commit --amend/git push --force
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-937) Site content revamp ahead of graduation

2020-05-24 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-937:

Status: Open  (was: New)

> Site content revamp ahead of graduation
> ---
>
> Key: HUDI-937
> URL: https://issues.apache.org/jira/browse/HUDI-937
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> Few good things to do 
>  
>  * Update all the users in powered-by page 
>  * Update the hpme page with new features text, intro text etc 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-937) Site content revamp ahead of graduation

2020-05-24 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115402#comment-17115402
 ] 

lamber-ken edited comment on HUDI-937 at 5/24/20, 5:59 PM:
---

Dear [~vinoth] please let me know if I can be of more assistance, contact me 
through Slack :)


was (Author: lamber-ken):
dear [~vinoth] please let me know if I can be of more assistance, contact me 
through Slack :)

> Site content revamp ahead of graduation
> ---
>
> Key: HUDI-937
> URL: https://issues.apache.org/jira/browse/HUDI-937
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> Few good things to do 
>  
>  * Update all the users in powered-by page 
>  * Update the hpme page with new features text, intro text etc 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-937) Site content revamp ahead of graduation

2020-05-24 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115402#comment-17115402
 ] 

lamber-ken commented on HUDI-937:
-

dear [~vinoth] please let me know if I can be of more assistance, contact me 
through Slack :)

> Site content revamp ahead of graduation
> ---
>
> Key: HUDI-937
> URL: https://issues.apache.org/jira/browse/HUDI-937
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> Few good things to do 
>  
>  * Update all the users in powered-by page 
>  * Update the hpme page with new features text, intro text etc 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-930) Check the private@, dev@ list

2020-05-24 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-930:
---

Assignee: Vinoth Chandar

> Check the private@, dev@ list
> -
>
> Key: HUDI-930
> URL: https://issues.apache.org/jira/browse/HUDI-930
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> * Check project-private mailing list membership. Mentors should be allowed to 
> remain if they wish to do so. The subscriber list should otherwise match that 
> on the resolution. See 
> [this|http://www.apache.org/dev/committers.html#mail-moderate] and the 
> [EZMLM|http://www.ezmlm.org/] "Moderator’s and Administrator’s Manual".
>  *  
>  * Double-check that all of your lists have sufficient active 
> [moderators|http://www.apache.org/dev/committers.html#mailing-list-moderators].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-923) Updating the site to reflect graduation

2020-05-24 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-923.
---
Resolution: Fixed

> Updating the site to reflect graduation
> ---
>
> Key: HUDI-923
> URL: https://issues.apache.org/jira/browse/HUDI-923
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>
> * Removed references to "Incubating" in all pages, logos, text etc..
>  * Update PMC/Chair/Committers information correctly. 
>  * Remove the incubator disclaimer on site?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-923) Updating the site to reflect graduation

2020-05-24 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reopened HUDI-923:
-

> Updating the site to reflect graduation
> ---
>
> Key: HUDI-923
> URL: https://issues.apache.org/jira/browse/HUDI-923
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>
> * Removed references to "Incubating" in all pages, logos, text etc..
>  * Update PMC/Chair/Committers information correctly. 
>  * Remove the incubator disclaimer on site?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-935) update travis name

2020-05-24 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-935:
---

Assignee: leesf

> update travis name
> --
>
> Key: HUDI-935
> URL: https://issues.apache.org/jira/browse/HUDI-935
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-927) https://hudi.incubator.apache.org should auto redirect to https://hudi.apache.org

2020-05-24 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115340#comment-17115340
 ] 

lamber-ken commented on HUDI-927:
-

Thanks [~smarthi]

> https://hudi.incubator.apache.org should auto redirect to 
> https://hudi.apache.org
> -
>
> Key: HUDI-927
> URL: https://issues.apache.org/jira/browse/HUDI-927
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.5.3
>
>
> This is still not happening.. need to wait for few days out a bit and if not 
> working still, raise a INFRA jira.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2020-05-14 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-897:

Status: Open  (was: New)

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2020-05-14 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107378#comment-17107378
 ] 

lamber-ken edited comment on HUDI-897 at 5/14/20, 2:50 PM:
---

Great addtion from my side (y)


was (Author: lamber-ken):
Gread addtion from my side (y)

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2020-05-14 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107378#comment-17107378
 ] 

lamber-ken commented on HUDI-897:
-

Gread addtion from my side (y)

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2020-05-14 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-897:

Issue Type: Improvement  (was: Bug)

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-891) Improve websites for graduation required content

2020-05-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-891.
-
Resolution: Fixed

> Improve websites for graduation required content
> 
>
> Key: HUDI-891
> URL: https://issues.apache.org/jira/browse/HUDI-891
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> check site
> [https://whimsy.apache.org/pods/project/hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-891) Improve websites for graduation required content

2020-05-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-891:
---

Assignee: lamber-ken

> Improve websites for graduation required content
> 
>
> Key: HUDI-891
> URL: https://issues.apache.org/jira/browse/HUDI-891
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> check site
> [https://whimsy.apache.org/pods/project/hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-891) Improve websites for graduation required content

2020-05-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-891:

Status: Open  (was: New)

> Improve websites for graduation required content
> 
>
> Key: HUDI-891
> URL: https://issues.apache.org/jira/browse/HUDI-891
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Priority: Major
>
> check site
> [https://whimsy.apache.org/pods/project/hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-891) Improve websites for graduation required content

2020-05-13 Thread lamber-ken (Jira)
lamber-ken created HUDI-891:
---

 Summary: Improve websites for graduation required content
 Key: HUDI-891
 URL: https://issues.apache.org/jira/browse/HUDI-891
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: lamber-ken


check site

[https://whimsy.apache.org/pods/project/hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-884) Hive Syncing using standalone tool failing due to avro version mismatches

2020-05-12 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-884:

Status: Open  (was: New)

> Hive Syncing using standalone tool failing due to avro version mismatches
> -
>
> Key: HUDI-884
> URL: https://issues.apache.org/jira/browse/HUDI-884
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Context : [https://github.com/apache/incubator-hudi/issues/1610]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-884) Hive Syncing using standalone tool failing due to avro version mismatches

2020-05-12 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-884.
-
Resolution: Fixed

Thanks [~uditme]

> Hive Syncing using standalone tool failing due to avro version mismatches
> -
>
> Key: HUDI-884
> URL: https://issues.apache.org/jira/browse/HUDI-884
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Context : [https://github.com/apache/incubator-hudi/issues/1610]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-12 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105240#comment-17105240
 ] 

lamber-ken commented on HUDI-494:
-

Seems a bug, we can discuss these in depth.

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master

2020-05-10 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-600:
---

Assignee: Balaji Varadarajan

> Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
> -
>
> Key: HUDI-600
> URL: https://issues.apache.org/jira/browse/HUDI-600
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Cleaner
>Reporter: Nishith Agarwal
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> ```
> org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144)
> at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87)
> at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837)
> at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ```
>  
> [~varadarb] any ideas about this ?
>  
> [~thesquelched] fyi



--
This message was sent 

[jira] [Commented] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master

2020-05-10 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103994#comment-17103994
 ] 

lamber-ken commented on HUDI-600:
-

[~vbalaji] good job! left one minor comment.

> Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
> -
>
> Key: HUDI-600
> URL: https://issues.apache.org/jira/browse/HUDI-600
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Cleaner
>Reporter: Nishith Agarwal
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> ```
> org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144)
> at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87)
> at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837)
> at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ```
>  
> [~varadarb] any ideas about this ?
>  
> [~thesquelched] fyi



--
This message was 

[jira] [Closed] (HUDI-787) Implement HoodieGlobalBloomIndexV2

2020-05-09 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-787.
---
Resolution: Fixed

Finished [https://github.com/apache/incubator-hudi/pull/1469]

> Implement HoodieGlobalBloomIndexV2
> --
>
> Key: HUDI-787
> URL: https://issues.apache.org/jira/browse/HUDI-787
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Index
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-836) Implement datadog metrics reporter

2020-04-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-836:
---

Assignee: Raymond Xu

> Implement datadog metrics reporter
> --
>
> Key: HUDI-836
> URL: https://issues.apache.org/jira/browse/HUDI-836
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>
> To implement a new metrics reporter type for datadog API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-836) Implement datadog metrics reporter

2020-04-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091134#comment-17091134
 ] 

lamber-ken commented on HUDI-836:
-

(y)

> Implement datadog metrics reporter
> --
>
> Key: HUDI-836
> URL: https://issues.apache.org/jira/browse/HUDI-836
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>
> To implement a new metrics reporter type for datadog API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-787) Implement HoodieGlobalBloomIndexV2

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-787:
---

Assignee: lamber-ken

> Implement HoodieGlobalBloomIndexV2
> --
>
> Key: HUDI-787
> URL: https://issues.apache.org/jira/browse/HUDI-787
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Index
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-821:

Status: Open  (was: New)

> Fix the wrong annotation of JCommander IStringConverter
> ---
>
> Key: HUDI-821
> URL: https://issues.apache.org/jira/browse/HUDI-821
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: pull-request-available
>
> Please refer to https://github.com/cbeust/jcommander/issues/253.
> If you define a list as argument to be parsed with an IStringConverter, 
> JCommander will create a List> instead of a List.
> we should change `converter = TransformersConverter.class` to `converter = 
> StringConverter.class, listConverter = TransformersConverter.class`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-821.
---
Resolution: Fixed

> Fix the wrong annotation of JCommander IStringConverter
> ---
>
> Key: HUDI-821
> URL: https://issues.apache.org/jira/browse/HUDI-821
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: pull-request-available
>
> Please refer to https://github.com/cbeust/jcommander/issues/253.
> If you define a list as argument to be parsed with an IStringConverter, 
> JCommander will create a List> instead of a List.
> we should change `converter = TransformersConverter.class` to `converter = 
> StringConverter.class, listConverter = TransformersConverter.class`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-827) Translation error

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-827.
-
Resolution: Fixed

> Translation error
> -
>
> Key: HUDI-827
> URL: https://issues.apache.org/jira/browse/HUDI-827
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: docs-chinese
>Affects Versions: 0.5.2
>Reporter: Lisheng Wang
>Assignee: Lisheng Wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> found translation error in 
> [https://hudi.apache.org/cn/docs/writing_data.html], 
> "如优化文件大小之类后", should be "如优化文件大小之后" 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option

2020-04-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-818:

Fix Version/s: 0.6.0

> Optimize the default value of hoodie.memory.merge.max.size option
> -
>
> Key: HUDI-818
> URL: https://issues.apache.org/jira/browse/HUDI-818
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> The default value of hoodie.memory.merge.max.size option is incapable of 
> meeting their performance requirements
> [https://github.com/apache/incubator-hudi/issues/1491]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option

2020-04-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-818:

Status: Open  (was: New)

> Optimize the default value of hoodie.memory.merge.max.size option
> -
>
> Key: HUDI-818
> URL: https://issues.apache.org/jira/browse/HUDI-818
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: lamber-ken
>Priority: Major
>
> The default value of hoodie.memory.merge.max.size option is incapable of 
> meeting their performance requirements
> [https://github.com/apache/incubator-hudi/issues/1491]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option

2020-04-20 Thread lamber-ken (Jira)
lamber-ken created HUDI-818:
---

 Summary: Optimize the default value of 
hoodie.memory.merge.max.size option
 Key: HUDI-818
 URL: https://issues.apache.org/jira/browse/HUDI-818
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Performance
Reporter: lamber-ken


The default value of hoodie.memory.merge.max.size option is incapable of 
meeting their performance requirements

[https://github.com/apache/incubator-hudi/issues/1491]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-04-16 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085345#comment-17085345
 ] 

lamber-ken commented on HUDI-716:
-

[https://github.com/apache/incubator-hudi/pull/1432]

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png, 
> image-2020-03-21-13-37-17-039.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator

2020-04-15 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-765:
---

Assignee: (was: lamber-ken)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-763) Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-763:

Summary: Add hoodie.table.base.file.format option to hoodie.properties file 
 (was: Add storage type option to hoodie.properties file)

> Add hoodie.table.base.file.format option to hoodie.properties file
> --
>
> Key: HUDI-763
> URL: https://issues.apache.org/jira/browse/HUDI-763
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-787) Implement HoodieGlobalBloomIndexV2

2020-04-11 Thread lamber-ken (Jira)
lamber-ken created HUDI-787:
---

 Summary: Implement HoodieGlobalBloomIndexV2
 Key: HUDI-787
 URL: https://issues.apache.org/jira/browse/HUDI-787
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: Index
Reporter: lamber-ken


Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-57) [UMBRELLA] Support ORC Storage

2020-04-08 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078643#comment-17078643
 ] 

lamber-ken commented on HUDI-57:


[~vinoth], I'm not sure.

 
||task||version||
|1. add storage type|0.6.0|
|2. support write|0.6.0|
|3. support reader|0.6.0|
|4. hive / spark query |may 0.6.1 / 0.6.2|

 

BTW, [~garyli1019] also interested in this, we will work together to 
accelerates the implementation process of this.

 

*PS:* 
 * I'm working on BloomV2
 * [~garyli1019] working on spark realtime query

 

We will come back here soon, any suggestion are welcome : )

> [UMBRELLA] Support ORC Storage
> --
>
> Key: HUDI-57
> URL: https://issues.apache.org/jira/browse/HUDI-57
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Hive Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/uber/hudi/issues/68]
> https://github.com/uber/hudi/issues/155



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-132) Automate doc update/deploy process

2020-04-06 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076608#comment-17076608
 ] 

lamber-ken commented on HUDI-132:
-

Hi [~vinoth], this jira looks the same as 
https://issues.apache.org/jira/browse/HUDI-504 ?

> Automate doc update/deploy process
> --
>
> Key: HUDI-132
> URL: https://issues.apache.org/jira/browse/HUDI-132
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Current docs (i.e the content powering hudi.apache.org) build, test, deploy 
> process is described at 
> [https://github.com/apache/incubator-hudi/tree/asf-site] 
> Its a two step process (1. change .md/template files, 2. generate the site 
> and upload) for making any changes. It would be nice to have automation on 
> GitHub actions, to automate the deploy of docs on the `asf-site` branch, such 
> that devs can just edit the docs and the merge will build and deploy the site 
> automatically.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-765:
---

Assignee: lamber-ken

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-765) Implement OrcReaderIterator

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-765:

Parent: HUDI-57
Issue Type: Sub-task  (was: New Feature)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-765) Implement OrcReaderIterator

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-765:

Status: Open  (was: New)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-763) Add storage type option to hoodie.properties file

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-763:

Priority: Major  (was: Critical)

> Add storage type option to hoodie.properties file
> -
>
> Key: HUDI-763
> URL: https://issues.apache.org/jira/browse/HUDI-763
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Priority: Major
>
> Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-765) Implement OrcReaderIterator

2020-04-06 Thread lamber-ken (Jira)
lamber-ken created HUDI-765:
---

 Summary: Implement OrcReaderIterator
 Key: HUDI-765
 URL: https://issues.apache.org/jira/browse/HUDI-765
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: lamber-ken






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-764) Implement HoodieOrcWriter

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-764:
---

Assignee: lamber-ken

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-764) Implement HoodieOrcWriter

2020-04-06 Thread lamber-ken (Jira)
lamber-ken created HUDI-764:
---

 Summary: Implement HoodieOrcWriter
 Key: HUDI-764
 URL: https://issues.apache.org/jira/browse/HUDI-764
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Storage Management
Reporter: lamber-ken


Implement HoodieOrcWriter
* Avro to ORC schema
* Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-764) Implement HoodieOrcWriter

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-764:

Status: Open  (was: New)

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Priority: Critical
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-763) Add storage type option to hoodie.properties file

2020-04-06 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-763:

Status: Open  (was: New)

> Add storage type option to hoodie.properties file
> -
>
> Key: HUDI-763
> URL: https://issues.apache.org/jira/browse/HUDI-763
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Priority: Critical
>
> Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-763) Add storage type option to hoodie.properties file

2020-04-06 Thread lamber-ken (Jira)
lamber-ken created HUDI-763:
---

 Summary: Add storage type option to hoodie.properties file
 Key: HUDI-763
 URL: https://issues.apache.org/jira/browse/HUDI-763
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Storage Management
Reporter: lamber-ken


Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-04-01 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073006#comment-17073006
 ] 

lamber-ken commented on HUDI-722:
-

IMO, it's hard to debug this issue, because writes are not always fails. I am 
sorry that I can not do a f2f session because of the limit of network. :(

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert

2020-04-01 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072975#comment-17072975
 ] 

lamber-ken commented on HUDI-718:
-

IMO, hudi-0.5 depends on avro-1.7.0, hudi-0.5.2 depends on avro-1.8.2, so they 
may not be compatible.
another way to solve this issue, please try to replace "fixed" with "string" 
type.

 

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-718) java.lang.ClassCastException during upsert

2020-04-01 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072874#comment-17072874
 ] 

lamber-ken edited comment on HUDI-718 at 4/1/20, 4:46 PM:
--

hi [~afilipchik], can you share the schema of old parquet file? and the type of 
bla.bla field?


was (Author: lamber-ken):
hi [~afilipchik], can you share the schema of old parquet file? 

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-718) java.lang.ClassCastException during upsert

2020-04-01 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072874#comment-17072874
 ] 

lamber-ken edited comment on HUDI-718 at 4/1/20, 4:46 PM:
--

hi [~afilipchik], can you share the schema of old parquet file? and what's the 
type of bla.bla field?


was (Author: lamber-ken):
hi [~afilipchik], can you share the schema of old parquet file? and the type of 
bla.bla field?

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert

2020-04-01 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072874#comment-17072874
 ] 

lamber-ken commented on HUDI-718:
-

hi [~afilipchik], can you share the schema of old parquet file? 

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-04-01 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-723:
---

Assignee: (was: lamber-ken)

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-30 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-716.
-
Resolution: Fixed

hi [~afilipchik] fixed at master branch now

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png, 
> image-2020-03-21-13-37-17-039.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-732) Generate site to content folder

2020-03-29 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-732.
-
Resolution: Fixed

> Generate site to content folder
> ---
>
> Key: HUDI-732
> URL: https://issues.apache.org/jira/browse/HUDI-732
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-29 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-686:

Status: Open  (was: New)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread lamber-ken (Jira)
lamber-ken created HUDI-742:
---

 Summary: Fix java.lang.NoSuchMethodError: 
java.lang.Math.floorMod(JI)I
 Key: HUDI-742
 URL: https://issues.apache.org/jira/browse/HUDI-742
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Spark Integration
Reporter: lamber-ken


*ISSUE* : https://github.com/apache/incubator-hudi/issues/1455

{code:java}
at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
... 49 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
at 
org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at 

[jira] [Updated] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-742:

Status: Open  (was: New)

> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Priority: Major
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
> at 
> 

[jira] [Updated] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-723:

Status: Open  (was: New)

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-723:
---

Assignee: lamber-ken

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-718:

Status: Open  (was: New)

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-718:
---

Assignee: lamber-ken

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-27 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069160#comment-17069160
 ] 

lamber-ken commented on HUDI-722:
-

Sure. Hi [~afilipchik], if you have time, can you share more context about it? 
e.g demo code, thanks.

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-722:
---

Assignee: lamber-ken

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-722:

Status: Open  (was: New)

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists

2020-03-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-734.
-
Resolution: Fixed

> Fix error: cannot create directory ‘test-content’: File exists
> --
>
> Key: HUDI-734
> URL: https://issues.apache.org/jira/browse/HUDI-734
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fix error: cannot create directory ‘test-content’: File exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists

2020-03-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-734:

Status: Open  (was: New)

> Fix error: cannot create directory ‘test-content’: File exists
> --
>
> Key: HUDI-734
> URL: https://issues.apache.org/jira/browse/HUDI-734
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: lamber-ken
>Priority: Major
>
> Fix error: cannot create directory ‘test-content’: File exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists

2020-03-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-734:
---

Assignee: lamber-ken

> Fix error: cannot create directory ‘test-content’: File exists
> --
>
> Key: HUDI-734
> URL: https://issues.apache.org/jira/browse/HUDI-734
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Fix error: cannot create directory ‘test-content’: File exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists

2020-03-25 Thread lamber-ken (Jira)
lamber-ken created HUDI-734:
---

 Summary: Fix error: cannot create directory ‘test-content’: File 
exists
 Key: HUDI-734
 URL: https://issues.apache.org/jira/browse/HUDI-734
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Docs
Reporter: lamber-ken


Fix error: cannot create directory ‘test-content’: File exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-732) Generate site to content folder

2020-03-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-732:

Status: Open  (was: New)

> Generate site to content folder
> ---
>
> Key: HUDI-732
> URL: https://issues.apache.org/jira/browse/HUDI-732
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-732) Generate site to content folder

2020-03-23 Thread lamber-ken (Jira)
lamber-ken created HUDI-732:
---

 Summary: Generate site to content folder
 Key: HUDI-732
 URL: https://issues.apache.org/jira/browse/HUDI-732
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Docs
Reporter: lamber-ken


Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-732) Generate site to content folder

2020-03-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-732:
---

Assignee: lamber-ken

> Generate site to content folder
> ---
>
> Key: HUDI-732
> URL: https://issues.apache.org/jira/browse/HUDI-732
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065030#comment-17065030
 ] 

lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM:
---

right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all datas for per partition
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD> tagLocation(JavaRDD> recordRDD,
JavaSparkContext jsc,
HoodieTable hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
  true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
  .flatMap(List::iterator)
  .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
  .filter(Option::isPresent)
  .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
  }
}{code}


was (Author: lamber-ken):
right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all partition datas
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD> tagLocation(JavaRDD> recordRDD,
JavaSparkContext jsc,
HoodieTable hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
  true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
  .flatMap(List::iterator)
  .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
  .filter(Option::isPresent)
  .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
  }
}{code}

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-504) Restructuring and auto-generation of docs

2020-03-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-504.
---
Resolution: Fixed

> Restructuring and auto-generation of docs
> -
>
> Key: HUDI-504
> URL: https://issues.apache.org/jira/browse/HUDI-504
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Ethan Guo
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> RFC-10: Restructuring and auto-generation of docs
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-646) Re-enable TestUpdateSchemaEvolution after triaging weird CI issue

2020-03-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-646.
---
Resolution: Fixed

> Re-enable TestUpdateSchemaEvolution after triaging weird CI issue
> -
>
> Key: HUDI-646
> URL: https://issues.apache.org/jira/browse/HUDI-646
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/incubator-hudi/pull/1346/commits/5b20891619380a66e2a62c9e57fb28c4f5ed948b
>  undo this
> {code}
> Job aborted due to stage failure: Task 7 in stage 1.0 failed 1 times, most 
> recent failure: Lost task 7.0 in stage 1.0 (TID 15, localhost, executor 
> driver): org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> file:/tmp/junit3406952253616234024/2016/01/31/f1-0_7-0-7_100.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readAvroRecords(ParquetUtils.java:190)
>   at 
> org.apache.hudi.client.TestUpdateSchemaEvolution.lambda$testSchemaEvolutionOnUpdate$dfb2f24e$1(TestUpdateSchemaEvolution.java:123)
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsupportedOperationException: Byte-buffer read 
> unsupported by input stream
>   at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:146)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:143)
>   at 
> org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81)
>   at 
> org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90)
>   at 
> org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>   ... 29 more
> {code}
> Only happens on travis. Locally succeeded over 5000 times individually.. And 
> the 

[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065030#comment-17065030
 ] 

lamber-ken commented on HUDI-686:
-

right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all partition datas
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD> tagLocation(JavaRDD> recordRDD,
JavaSparkContext jsc,
HoodieTable hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
  true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
  .flatMap(List::iterator)
  .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
  .filter(Option::isPresent)
  .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
  }
}{code}

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-730) Fix the ci error log message

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-730:
---

Assignee: lamber-ken

> Fix the ci error log message
> 
>
> Key: HUDI-730
> URL: https://issues.apache.org/jira/browse/HUDI-730
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Testing
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> {code:java}
> [ERROR] 2020-03-21 17:30:24,613 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
> running preferred function. Trying secondary
> java.lang.RuntimeException
> at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at 
> org.mockito.internal.runners.JUnit45AndHigherRunnerImpl.run(JUnit45AndHigherRunnerImpl.java:37)
> at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:367)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:274)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:161)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)
> [WARN ] 2020-03-21 17:30:24,628 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Routing 
> request to secondary file-system view
> [WARN ] 2020-03-21 17:30:24,630 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Routing 
> request to secondary file-system view
> [ERROR] 2020-03-21 17:30:24,638 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
> running preferred function. Trying secondary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-730) Fix the ci error log message

2020-03-21 Thread lamber-ken (Jira)
lamber-ken created HUDI-730:
---

 Summary: Fix the ci error log message
 Key: HUDI-730
 URL: https://issues.apache.org/jira/browse/HUDI-730
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Testing
Reporter: lamber-ken


{code:java}
[ERROR] 2020-03-21 17:30:24,613 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
running preferred function. Trying secondary
java.lang.RuntimeException
at 
org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.mockito.internal.runners.JUnit45AndHigherRunnerImpl.run(JUnit45AndHigherRunnerImpl.java:37)
at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:367)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:274)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:161)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)
[WARN ] 2020-03-21 17:30:24,628 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Routing 
request to secondary file-system view
[WARN ] 2020-03-21 17:30:24,630 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Routing 
request to secondary file-system view
[ERROR] 2020-03-21 17:30:24,638 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
running preferred function. Trying secondary
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-730) Fix the ci error log message

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-730:

Status: Open  (was: New)

> Fix the ci error log message
> 
>
> Key: HUDI-730
> URL: https://issues.apache.org/jira/browse/HUDI-730
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Testing
>Reporter: lamber-ken
>Priority: Major
>
> {code:java}
> [ERROR] 2020-03-21 17:30:24,613 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
> running preferred function. Trying secondary
> java.lang.RuntimeException
> at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at 
> org.mockito.internal.runners.JUnit45AndHigherRunnerImpl.run(JUnit45AndHigherRunnerImpl.java:37)
> at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:367)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:274)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:161)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)
> [WARN ] 2020-03-21 17:30:24,628 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Routing 
> request to secondary file-system view
> [WARN ] 2020-03-21 17:30:24,630 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Routing 
> request to secondary file-system view
> [ERROR] 2020-03-21 17:30:24,638 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
> running preferred function. Trying secondary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-729:

Status: In Progress  (was: Open)

> Replace JavaSparkContext/SQLContext with SparkSession
> -
>
> Key: HUDI-729
> URL: https://issues.apache.org/jira/browse/HUDI-729
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Replace JavaSparkContext/SQLContext with SparkSession.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-729:

Status: Open  (was: New)

> Replace JavaSparkContext/SQLContext with SparkSession
> -
>
> Key: HUDI-729
> URL: https://issues.apache.org/jira/browse/HUDI-729
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Replace JavaSparkContext/SQLContext with SparkSession.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession

2020-03-21 Thread lamber-ken (Jira)
lamber-ken created HUDI-729:
---

 Summary: Replace JavaSparkContext/SQLContext with SparkSession
 Key: HUDI-729
 URL: https://issues.apache.org/jira/browse/HUDI-729
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Code Cleanup
Reporter: lamber-ken


Replace JavaSparkContext/SQLContext with SparkSession.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-729:
---

Assignee: lamber-ken

> Replace JavaSparkContext/SQLContext with SparkSession
> -
>
> Key: HUDI-729
> URL: https://issues.apache.org/jira/browse/HUDI-729
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Replace JavaSparkContext/SQLContext with SparkSession.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-401) Remove unnecessary use of spark in savepoint timeline

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-401:
---

Assignee: hong dongdong

> Remove unnecessary use of spark in savepoint timeline
> -
>
> Key: HUDI-401
> URL: https://issues.apache.org/jira/browse/HUDI-401
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI, Writer Core
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, javasparkcontext was inited when savepoint create, but it is not 
> necessary.  Javasparkcontext's whole work is provide hadoopconfig, but need 
> time and resources to init it. 
> So we can use hadoop config instead of jsc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-719) Exception during clean phase: Found org.apache.hudi.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleanerPlan

2020-03-21 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063816#comment-17063816
 ] 

lamber-ken commented on HUDI-719:
-

[~afilipchik] Thanks for reporting these issues during upgrade 0.5.0 to master 
branch, if you are interested in fix them, you can ask [~vinoth] give you 
contribute permission. :)

> Exception during clean phase: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan
> --
>
> Key: HUDI-719
> URL: https://issues.apache.org/jira/browse/HUDI-719
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Dataset is written using 0.5 moving to the latest master:
>  
> Exception in thread "main" org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
>  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
>  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>  at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
>  at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
>  at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
>  at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>  at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86)
>  at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843)
>  at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:397)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>  at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>  at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-21 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063813#comment-17063813
 ] 

lamber-ken commented on HUDI-718:
-

Hi [~afilipchik], the master version use spark-2.4.4 which dependes on 
avro-1.82. 

here is a pr[1] which fix the similar issue, if you are interested, you can 
have a try

[1] [https://github.com/apache/incubator-hudi/pull/1339]

!image-2020-03-21-16-49-28-905.png|width=790,height=631!

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-21 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-718:

Attachment: image-2020-03-21-16-49-28-905.png

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063775#comment-17063775
 ] 

lamber-ken commented on HUDI-716:
-

The life cycle of *.clean files: (base on hudi-0.5.0) 
[https://github.com/apache/incubator-hudi/blob/release-0.5.0|https://github.com/apache/incubator-hudi/blob/release-0.5.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java]

!image-2020-03-21-13-37-17-039.png!

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png, 
> image-2020-03-21-13-37-17-039.png
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   >