[jira] [Assigned] (HUDI-998) Introduce a robot to build testing website automatically
[ https://issues.apache.org/jira/browse/HUDI-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-998: --- Assignee: lamber-ken > Introduce a robot to build testing website automatically > > > Key: HUDI-998 > URL: https://issues.apache.org/jira/browse/HUDI-998 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-998) Introduce a robot to build testing website automatically
lamber-ken created HUDI-998: --- Summary: Introduce a robot to build testing website automatically Key: HUDI-998 URL: https://issues.apache.org/jira/browse/HUDI-998 Project: Apache Hudi Issue Type: Improvement Reporter: lamber-ken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-998) Introduce a robot to build testing website automatically
[ https://issues.apache.org/jira/browse/HUDI-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-998: Status: Open (was: New) > Introduce a robot to build testing website automatically > > > Key: HUDI-998 > URL: https://issues.apache.org/jira/browse/HUDI-998 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-997) Debug possible memory leaks when running tests in hudi-client
[ https://issues.apache.org/jira/browse/HUDI-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126090#comment-17126090 ] lamber-ken commented on HUDI-997: - Sure, thanks :) > Debug possible memory leaks when running tests in hudi-client > - > > Key: HUDI-997 > URL: https://issues.apache.org/jira/browse/HUDI-997 > Project: Apache Hudi > Issue Type: Task > Components: Testing >Reporter: Balaji Varadarajan >Assignee: lamber-ken >Priority: Major > > Using visualvm, noticed the memory leak pattern of gradual increase in > reachable memory (after GC) as the tests progress. > Some possible candidates where I noticed marked increase in memory while the > tests were running : > > [INFO] Running org.apache.hudi.table.TestCleaner > [INFO] Running org.apache.hudi.table.TestHoodieMergeOnReadTable > [INFO] Running > org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor > [INFO] Running org.apache.hudi.table.action.compact.TestAsyncCompaction > [INFO] Running org.apache.hudi.index.hbase.TestHBaseIndex > [INFO] Running org.apache.hudi.index.TestHoodieIndex > [INFO] Running org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-997) Debug possible memory leaks when running tests in hudi-client
[ https://issues.apache.org/jira/browse/HUDI-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-997: --- Assignee: lamber-ken > Debug possible memory leaks when running tests in hudi-client > - > > Key: HUDI-997 > URL: https://issues.apache.org/jira/browse/HUDI-997 > Project: Apache Hudi > Issue Type: Task > Components: Testing >Reporter: Balaji Varadarajan >Assignee: lamber-ken >Priority: Major > > Using visualvm, noticed the memory leak pattern of gradual increase in > reachable memory (after GC) as the tests progress. > Some possible candidates where I noticed marked increase in memory while the > tests were running : > > [INFO] Running org.apache.hudi.table.TestCleaner > [INFO] Running org.apache.hudi.table.TestHoodieMergeOnReadTable > [INFO] Running > org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor > [INFO] Running org.apache.hudi.table.action.compact.TestAsyncCompaction > [INFO] Running org.apache.hudi.index.hbase.TestHBaseIndex > [INFO] Running org.apache.hudi.index.TestHoodieIndex > [INFO] Running org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-986) Support staging site for per pull request
[ https://issues.apache.org/jira/browse/HUDI-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-986: Status: Open (was: New) > Support staging site for per pull request > - > > Key: HUDI-986 > URL: https://issues.apache.org/jira/browse/HUDI-986 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-986) Support staging site for per pull request
lamber-ken created HUDI-986: --- Summary: Support staging site for per pull request Key: HUDI-986 URL: https://issues.apache.org/jira/browse/HUDI-986 Project: Apache Hudi Issue Type: Improvement Reporter: lamber-ken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-986) Support staging site for per pull request
[ https://issues.apache.org/jira/browse/HUDI-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-986: --- Assignee: lamber-ken > Support staging site for per pull request > - > > Key: HUDI-986 > URL: https://issues.apache.org/jira/browse/HUDI-986 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-986) Support staging site for per pull request
[ https://issues.apache.org/jira/browse/HUDI-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-986: Description: support staging site for per pull request > Support staging site for per pull request > - > > Key: HUDI-986 > URL: https://issues.apache.org/jira/browse/HUDI-986 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > support staging site for per pull request -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-985) Introduce rerun ci bot
[ https://issues.apache.org/jira/browse/HUDI-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-985: Status: Open (was: New) > Introduce rerun ci bot > -- > > Key: HUDI-985 > URL: https://issues.apache.org/jira/browse/HUDI-985 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Introduce rerun ci bot, help to rerun tests > > Replace > {code:java} > git commit --amend/git push --force > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-985) Introduce rerun ci bot
lamber-ken created HUDI-985: --- Summary: Introduce rerun ci bot Key: HUDI-985 URL: https://issues.apache.org/jira/browse/HUDI-985 Project: Apache Hudi Issue Type: Improvement Reporter: lamber-ken Introduce rerun ci bot, help to rerun tests Replace {code:java} git commit --amend/git push --force {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-985) Introduce rerun ci bot
[ https://issues.apache.org/jira/browse/HUDI-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-985: --- Assignee: lamber-ken > Introduce rerun ci bot > -- > > Key: HUDI-985 > URL: https://issues.apache.org/jira/browse/HUDI-985 > Project: Apache Hudi > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Introduce rerun ci bot, help to rerun tests > > Replace > {code:java} > git commit --amend/git push --force > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-937) Site content revamp ahead of graduation
[ https://issues.apache.org/jira/browse/HUDI-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-937: Status: Open (was: New) > Site content revamp ahead of graduation > --- > > Key: HUDI-937 > URL: https://issues.apache.org/jira/browse/HUDI-937 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > Few good things to do > > * Update all the users in powered-by page > * Update the hpme page with new features text, intro text etc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-937) Site content revamp ahead of graduation
[ https://issues.apache.org/jira/browse/HUDI-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115402#comment-17115402 ] lamber-ken edited comment on HUDI-937 at 5/24/20, 5:59 PM: --- Dear [~vinoth] please let me know if I can be of more assistance, contact me through Slack :) was (Author: lamber-ken): dear [~vinoth] please let me know if I can be of more assistance, contact me through Slack :) > Site content revamp ahead of graduation > --- > > Key: HUDI-937 > URL: https://issues.apache.org/jira/browse/HUDI-937 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > Few good things to do > > * Update all the users in powered-by page > * Update the hpme page with new features text, intro text etc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-937) Site content revamp ahead of graduation
[ https://issues.apache.org/jira/browse/HUDI-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115402#comment-17115402 ] lamber-ken commented on HUDI-937: - dear [~vinoth] please let me know if I can be of more assistance, contact me through Slack :) > Site content revamp ahead of graduation > --- > > Key: HUDI-937 > URL: https://issues.apache.org/jira/browse/HUDI-937 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > Few good things to do > > * Update all the users in powered-by page > * Update the hpme page with new features text, intro text etc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-930) Check the private@, dev@ list
[ https://issues.apache.org/jira/browse/HUDI-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-930: --- Assignee: Vinoth Chandar > Check the private@, dev@ list > - > > Key: HUDI-930 > URL: https://issues.apache.org/jira/browse/HUDI-930 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > * Check project-private mailing list membership. Mentors should be allowed to > remain if they wish to do so. The subscriber list should otherwise match that > on the resolution. See > [this|http://www.apache.org/dev/committers.html#mail-moderate] and the > [EZMLM|http://www.ezmlm.org/] "Moderator’s and Administrator’s Manual". > * > * Double-check that all of your lists have sufficient active > [moderators|http://www.apache.org/dev/committers.html#mailing-list-moderators]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-923) Updating the site to reflect graduation
[ https://issues.apache.org/jira/browse/HUDI-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken closed HUDI-923. --- Resolution: Fixed > Updating the site to reflect graduation > --- > > Key: HUDI-923 > URL: https://issues.apache.org/jira/browse/HUDI-923 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: pull-request-available > > * Removed references to "Incubating" in all pages, logos, text etc.. > * Update PMC/Chair/Committers information correctly. > * Remove the incubator disclaimer on site? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-923) Updating the site to reflect graduation
[ https://issues.apache.org/jira/browse/HUDI-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reopened HUDI-923: - > Updating the site to reflect graduation > --- > > Key: HUDI-923 > URL: https://issues.apache.org/jira/browse/HUDI-923 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: pull-request-available > > * Removed references to "Incubating" in all pages, logos, text etc.. > * Update PMC/Chair/Committers information correctly. > * Remove the incubator disclaimer on site? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-935) update travis name
[ https://issues.apache.org/jira/browse/HUDI-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-935: --- Assignee: leesf > update travis name > -- > > Key: HUDI-935 > URL: https://issues.apache.org/jira/browse/HUDI-935 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: leesf >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-927) https://hudi.incubator.apache.org should auto redirect to https://hudi.apache.org
[ https://issues.apache.org/jira/browse/HUDI-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115340#comment-17115340 ] lamber-ken commented on HUDI-927: - Thanks [~smarthi] > https://hudi.incubator.apache.org should auto redirect to > https://hudi.apache.org > - > > Key: HUDI-927 > URL: https://issues.apache.org/jira/browse/HUDI-927 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: Suneel Marthi >Priority: Major > Fix For: 0.5.3 > > > This is still not happening.. need to wait for few days out a bit and if not > working still, raise a INFRA jira.. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-897: Status: Open (was: New) > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Compaction, Performance >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107378#comment-17107378 ] lamber-ken edited comment on HUDI-897 at 5/14/20, 2:50 PM: --- Great addtion from my side (y) was (Author: lamber-ken): Gread addtion from my side (y) > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Compaction, Performance >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107378#comment-17107378 ] lamber-ken commented on HUDI-897: - Gread addtion from my side (y) > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Compaction, Performance >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-897: Issue Type: Improvement (was: Bug) > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Compaction, Performance >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-891) Improve websites for graduation required content
[ https://issues.apache.org/jira/browse/HUDI-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken resolved HUDI-891. - Resolution: Fixed > Improve websites for graduation required content > > > Key: HUDI-891 > URL: https://issues.apache.org/jira/browse/HUDI-891 > Project: Apache Hudi (incubating) > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > check site > [https://whimsy.apache.org/pods/project/hudi] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-891) Improve websites for graduation required content
[ https://issues.apache.org/jira/browse/HUDI-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-891: --- Assignee: lamber-ken > Improve websites for graduation required content > > > Key: HUDI-891 > URL: https://issues.apache.org/jira/browse/HUDI-891 > Project: Apache Hudi (incubating) > Issue Type: Improvement >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > check site > [https://whimsy.apache.org/pods/project/hudi] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-891) Improve websites for graduation required content
[ https://issues.apache.org/jira/browse/HUDI-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-891: Status: Open (was: New) > Improve websites for graduation required content > > > Key: HUDI-891 > URL: https://issues.apache.org/jira/browse/HUDI-891 > Project: Apache Hudi (incubating) > Issue Type: Improvement >Reporter: lamber-ken >Priority: Major > > check site > [https://whimsy.apache.org/pods/project/hudi] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-891) Improve websites for graduation required content
lamber-ken created HUDI-891: --- Summary: Improve websites for graduation required content Key: HUDI-891 URL: https://issues.apache.org/jira/browse/HUDI-891 Project: Apache Hudi (incubating) Issue Type: Improvement Reporter: lamber-ken check site [https://whimsy.apache.org/pods/project/hudi] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-884) Hive Syncing using standalone tool failing due to avro version mismatches
[ https://issues.apache.org/jira/browse/HUDI-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-884: Status: Open (was: New) > Hive Syncing using standalone tool failing due to avro version mismatches > - > > Key: HUDI-884 > URL: https://issues.apache.org/jira/browse/HUDI-884 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Hive Integration >Reporter: Balaji Varadarajan >Assignee: Udit Mehrotra >Priority: Blocker > Fix For: 0.6.0 > > > Context : [https://github.com/apache/incubator-hudi/issues/1610] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-884) Hive Syncing using standalone tool failing due to avro version mismatches
[ https://issues.apache.org/jira/browse/HUDI-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken resolved HUDI-884. - Resolution: Fixed Thanks [~uditme] > Hive Syncing using standalone tool failing due to avro version mismatches > - > > Key: HUDI-884 > URL: https://issues.apache.org/jira/browse/HUDI-884 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Hive Integration >Reporter: Balaji Varadarajan >Assignee: Udit Mehrotra >Priority: Blocker > Fix For: 0.6.0 > > > Context : [https://github.com/apache/incubator-hudi/issues/1610] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105240#comment-17105240 ] lamber-ken commented on HUDI-494: - Seems a bug, we can discuss these in depth. > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: bug-bash-0.6.0, pull-request-available > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
[ https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-600: --- Assignee: Balaji Varadarajan > Cleaner fails with AVRO exception when upgrading from 0.5.0 to master > - > > Key: HUDI-600 > URL: https://issues.apache.org/jira/browse/HUDI-600 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Cleaner >Reporter: Nishith Agarwal >Assignee: Balaji Varadarajan >Priority: Major > Labels: help-requested > Fix For: 0.6.0 > > > ``` > org.apache.avro.AvroTypeException: Found > org.apache.hudi.avro.model.HoodieCleanMetadata, expecting > org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy > at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > at > org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) > at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) > at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) > at > org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149) > at > org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88) > at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144) > at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87) > at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837) > at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514) > at > org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156) > at > org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100) > at > org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91) > at > org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261) > at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) > ``` > > [~varadarb] any ideas about this ? > > [~thesquelched] fyi -- This message was sent
[jira] [Commented] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
[ https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103994#comment-17103994 ] lamber-ken commented on HUDI-600: - [~vbalaji] good job! left one minor comment. > Cleaner fails with AVRO exception when upgrading from 0.5.0 to master > - > > Key: HUDI-600 > URL: https://issues.apache.org/jira/browse/HUDI-600 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Cleaner >Reporter: Nishith Agarwal >Priority: Major > Labels: help-requested > Fix For: 0.6.0 > > > ``` > org.apache.avro.AvroTypeException: Found > org.apache.hudi.avro.model.HoodieCleanMetadata, expecting > org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy > at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > at > org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) > at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) > at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) > at > org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149) > at > org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88) > at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144) > at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87) > at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837) > at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514) > at > org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156) > at > org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100) > at > org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91) > at > org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261) > at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) > ``` > > [~varadarb] any ideas about this ? > > [~thesquelched] fyi -- This message was
[jira] [Closed] (HUDI-787) Implement HoodieGlobalBloomIndexV2
[ https://issues.apache.org/jira/browse/HUDI-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken closed HUDI-787. --- Resolution: Fixed Finished [https://github.com/apache/incubator-hudi/pull/1469] > Implement HoodieGlobalBloomIndexV2 > -- > > Key: HUDI-787 > URL: https://issues.apache.org/jira/browse/HUDI-787 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Index >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-836) Implement datadog metrics reporter
[ https://issues.apache.org/jira/browse/HUDI-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-836: --- Assignee: Raymond Xu > Implement datadog metrics reporter > -- > > Key: HUDI-836 > URL: https://issues.apache.org/jira/browse/HUDI-836 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Common Core >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Fix For: 0.6.0 > > > To implement a new metrics reporter type for datadog API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-836) Implement datadog metrics reporter
[ https://issues.apache.org/jira/browse/HUDI-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091134#comment-17091134 ] lamber-ken commented on HUDI-836: - (y) > Implement datadog metrics reporter > -- > > Key: HUDI-836 > URL: https://issues.apache.org/jira/browse/HUDI-836 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Common Core >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Fix For: 0.6.0 > > > To implement a new metrics reporter type for datadog API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-787) Implement HoodieGlobalBloomIndexV2
[ https://issues.apache.org/jira/browse/HUDI-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-787: --- Assignee: lamber-ken > Implement HoodieGlobalBloomIndexV2 > -- > > Key: HUDI-787 > URL: https://issues.apache.org/jira/browse/HUDI-787 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Index >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter
[ https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-821: Status: Open (was: New) > Fix the wrong annotation of JCommander IStringConverter > --- > > Key: HUDI-821 > URL: https://issues.apache.org/jira/browse/HUDI-821 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: dengziming >Assignee: dengziming >Priority: Minor > Labels: pull-request-available > > Please refer to https://github.com/cbeust/jcommander/issues/253. > If you define a list as argument to be parsed with an IStringConverter, > JCommander will create a List> instead of a List. > we should change `converter = TransformersConverter.class` to `converter = > StringConverter.class, listConverter = TransformersConverter.class`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter
[ https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken closed HUDI-821. --- Resolution: Fixed > Fix the wrong annotation of JCommander IStringConverter > --- > > Key: HUDI-821 > URL: https://issues.apache.org/jira/browse/HUDI-821 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: dengziming >Assignee: dengziming >Priority: Minor > Labels: pull-request-available > > Please refer to https://github.com/cbeust/jcommander/issues/253. > If you define a list as argument to be parsed with an IStringConverter, > JCommander will create a List> instead of a List. > we should change `converter = TransformersConverter.class` to `converter = > StringConverter.class, listConverter = TransformersConverter.class`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-827) Translation error
[ https://issues.apache.org/jira/browse/HUDI-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken resolved HUDI-827. - Resolution: Fixed > Translation error > - > > Key: HUDI-827 > URL: https://issues.apache.org/jira/browse/HUDI-827 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: docs-chinese >Affects Versions: 0.5.2 >Reporter: Lisheng Wang >Assignee: Lisheng Wang >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > > found translation error in > [https://hudi.apache.org/cn/docs/writing_data.html], > "如优化文件大小之类后", should be "如优化文件大小之后" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option
[ https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-818: Fix Version/s: 0.6.0 > Optimize the default value of hoodie.memory.merge.max.size option > - > > Key: HUDI-818 > URL: https://issues.apache.org/jira/browse/HUDI-818 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance >Reporter: lamber-ken >Priority: Major > Fix For: 0.6.0 > > > The default value of hoodie.memory.merge.max.size option is incapable of > meeting their performance requirements > [https://github.com/apache/incubator-hudi/issues/1491] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option
[ https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-818: Status: Open (was: New) > Optimize the default value of hoodie.memory.merge.max.size option > - > > Key: HUDI-818 > URL: https://issues.apache.org/jira/browse/HUDI-818 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance >Reporter: lamber-ken >Priority: Major > > The default value of hoodie.memory.merge.max.size option is incapable of > meeting their performance requirements > [https://github.com/apache/incubator-hudi/issues/1491] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option
lamber-ken created HUDI-818: --- Summary: Optimize the default value of hoodie.memory.merge.max.size option Key: HUDI-818 URL: https://issues.apache.org/jira/browse/HUDI-818 Project: Apache Hudi (incubating) Issue Type: Improvement Components: Performance Reporter: lamber-ken The default value of hoodie.memory.merge.max.size option is incapable of meeting their performance requirements [https://github.com/apache/incubator-hudi/issues/1491] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean
[ https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085345#comment-17085345 ] lamber-ken commented on HUDI-716: - [https://github.com/apache/incubator-hudi/pull/1432] > Exception: Not an Avro data file when running HoodieCleanClient.runClean > > > Key: HUDI-716 > URL: https://issues.apache.org/jira/browse/HUDI-716 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > Attachments: image-2020-03-21-02-45-25-099.png, > image-2020-03-21-13-37-17-039.png > > Time Spent: 20m > Remaining Estimate: 0h > > Just upgraded to upstream master from 0.5 and seeing an issue at the end of > the delta sync run: > 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync > once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error > running delta sync once. Shutting > downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) > at > org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) > at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) > at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) > at > org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at > org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > java.io.IOException: Not an Avro data file at > org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at > org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147) > at > org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) > ... 24 more > > It is attempting to read an old cleanup file (2 month old) and crashing > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-765: --- Assignee: (was: lamber-ken) > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-763) Add hoodie.table.base.file.format option to hoodie.properties file
[ https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-763: Summary: Add hoodie.table.base.file.format option to hoodie.properties file (was: Add storage type option to hoodie.properties file) > Add hoodie.table.base.file.format option to hoodie.properties file > -- > > Key: HUDI-763 > URL: https://issues.apache.org/jira/browse/HUDI-763 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-787) Implement HoodieGlobalBloomIndexV2
lamber-ken created HUDI-787: --- Summary: Implement HoodieGlobalBloomIndexV2 Key: HUDI-787 URL: https://issues.apache.org/jira/browse/HUDI-787 Project: Apache Hudi (incubating) Issue Type: New Feature Components: Index Reporter: lamber-ken Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-57) [UMBRELLA] Support ORC Storage
[ https://issues.apache.org/jira/browse/HUDI-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078643#comment-17078643 ] lamber-ken commented on HUDI-57: [~vinoth], I'm not sure. ||task||version|| |1. add storage type|0.6.0| |2. support write|0.6.0| |3. support reader|0.6.0| |4. hive / spark query |may 0.6.1 / 0.6.2| BTW, [~garyli1019] also interested in this, we will work together to accelerates the implementation process of this. *PS:* * I'm working on BloomV2 * [~garyli1019] working on spark realtime query We will come back here soon, any suggestion are welcome : ) > [UMBRELLA] Support ORC Storage > -- > > Key: HUDI-57 > URL: https://issues.apache.org/jira/browse/HUDI-57 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Hive Integration, Writer Core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > [https://github.com/uber/hudi/issues/68] > https://github.com/uber/hudi/issues/155 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-132) Automate doc update/deploy process
[ https://issues.apache.org/jira/browse/HUDI-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076608#comment-17076608 ] lamber-ken commented on HUDI-132: - Hi [~vinoth], this jira looks the same as https://issues.apache.org/jira/browse/HUDI-504 ? > Automate doc update/deploy process > -- > > Key: HUDI-132 > URL: https://issues.apache.org/jira/browse/HUDI-132 > Project: Apache Hudi (incubating) > Issue Type: Task > Components: Release Administrative >Reporter: Vinoth Chandar >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > > Current docs (i.e the content powering hudi.apache.org) build, test, deploy > process is described at > [https://github.com/apache/incubator-hudi/tree/asf-site] > Its a two step process (1. change .md/template files, 2. generate the site > and upload) for making any changes. It would be nice to have automation on > GitHub actions, to automate the deploy of docs on the `asf-site` branch, such > that devs can just edit the docs and the merge will build and deploy the site > automatically. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-765: --- Assignee: lamber-ken > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-765: Parent: HUDI-57 Issue Type: Sub-task (was: New Feature) > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-765: Status: Open (was: New) > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: lamber-ken >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-763) Add storage type option to hoodie.properties file
[ https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-763: Priority: Major (was: Critical) > Add storage type option to hoodie.properties file > - > > Key: HUDI-763 > URL: https://issues.apache.org/jira/browse/HUDI-763 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Priority: Major > > Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-765) Implement OrcReaderIterator
lamber-ken created HUDI-765: --- Summary: Implement OrcReaderIterator Key: HUDI-765 URL: https://issues.apache.org/jira/browse/HUDI-765 Project: Apache Hudi (incubating) Issue Type: New Feature Reporter: lamber-ken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-764) Implement HoodieOrcWriter
[ https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-764: --- Assignee: lamber-ken > Implement HoodieOrcWriter > - > > Key: HUDI-764 > URL: https://issues.apache.org/jira/browse/HUDI-764 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Critical > > Implement HoodieOrcWriter > * Avro to ORC schema > * Write record in row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-764) Implement HoodieOrcWriter
lamber-ken created HUDI-764: --- Summary: Implement HoodieOrcWriter Key: HUDI-764 URL: https://issues.apache.org/jira/browse/HUDI-764 Project: Apache Hudi (incubating) Issue Type: Sub-task Components: Storage Management Reporter: lamber-ken Implement HoodieOrcWriter * Avro to ORC schema * Write record in row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-764) Implement HoodieOrcWriter
[ https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-764: Status: Open (was: New) > Implement HoodieOrcWriter > - > > Key: HUDI-764 > URL: https://issues.apache.org/jira/browse/HUDI-764 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Priority: Critical > > Implement HoodieOrcWriter > * Avro to ORC schema > * Write record in row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-763) Add storage type option to hoodie.properties file
[ https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-763: Status: Open (was: New) > Add storage type option to hoodie.properties file > - > > Key: HUDI-763 > URL: https://issues.apache.org/jira/browse/HUDI-763 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Priority: Critical > > Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-763) Add storage type option to hoodie.properties file
lamber-ken created HUDI-763: --- Summary: Add storage type option to hoodie.properties file Key: HUDI-763 URL: https://issues.apache.org/jira/browse/HUDI-763 Project: Apache Hudi (incubating) Issue Type: Sub-task Components: Storage Management Reporter: lamber-ken Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet
[ https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073006#comment-17073006 ] lamber-ken commented on HUDI-722: - IMO, it's hard to debug this issue, because writes are not always fails. I am sorry that I can not do a f2f session because of the limit of network. :( > IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when > writing parquet > - > > Key: HUDI-722 > URL: https://issues.apache.org/jira/browse/HUDI-722 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Writer Core >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > > Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array > range: X to X inside MessageColumnIORecordConsumer.addBinary call. > Specifically: getColumnWriter().write(value, r[currentLevel], > currentColumnIO.getDefinitionLevel()); > fails as size of r is the same as current level. What can be causing it? > > It gets executed via: ParquetWriter.write(IndexedRecord) Library version: > 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of > unions present). > But what is surprising is that it fails to write top level field: > PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is > the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", > "_hoodie_commit_seqno": "20200317215711_0_650", -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072975#comment-17072975 ] lamber-ken commented on HUDI-718: - IMO, hudi-0.5 depends on avro-1.7.0, hudi-0.5.2 depends on avro-1.8.2, so they may not be compatible. another way to solve this issue, please try to replace "fixed" with "string" type. > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072874#comment-17072874 ] lamber-ken edited comment on HUDI-718 at 4/1/20, 4:46 PM: -- hi [~afilipchik], can you share the schema of old parquet file? and the type of bla.bla field? was (Author: lamber-ken): hi [~afilipchik], can you share the schema of old parquet file? > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072874#comment-17072874 ] lamber-ken edited comment on HUDI-718 at 4/1/20, 4:46 PM: -- hi [~afilipchik], can you share the schema of old parquet file? and what's the type of bla.bla field? was (Author: lamber-ken): hi [~afilipchik], can you share the schema of old parquet file? and the type of bla.bla field? > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072874#comment-17072874 ] lamber-ken commented on HUDI-718: - hi [~afilipchik], can you share the schema of old parquet file? > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-723) SqlTransformer's schema sometimes is not registered.
[ https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-723: --- Assignee: (was: lamber-ken) > SqlTransformer's schema sometimes is not registered. > - > > Key: HUDI-723 > URL: https://issues.apache.org/jira/browse/HUDI-723 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Priority: Major > Fix For: 0.6.0 > > > If schema is inferred from RowBasedSchemaProvider when SQL transformer is > used it also needs to be registered. > > Current way only works if SchemaProvider has a valid target schema. Is one > wants to use schema from SQL transformation, the result of > RowBasedSchemaProvider.getTargetSchema needs to be passed into something like: > {code:java} > private void setupWriteClient(SchemaProvider schemaProvider) { > LOG.info("Setting up Hoodie Write Client"); > registerAvroSchemas(schemaProvider); > HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider); > writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true); > onInitializingHoodieWriteClient.apply(writeClient); > } > {code} > Existent method will not work as it is checking for: > {code:java} > if ((null != schemaProvider) && (null == writeClient)) { > {code} > and writeClient is already configured. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean
[ https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken resolved HUDI-716. - Resolution: Fixed hi [~afilipchik] fixed at master branch now > Exception: Not an Avro data file when running HoodieCleanClient.runClean > > > Key: HUDI-716 > URL: https://issues.apache.org/jira/browse/HUDI-716 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > Attachments: image-2020-03-21-02-45-25-099.png, > image-2020-03-21-13-37-17-039.png > > Time Spent: 20m > Remaining Estimate: 0h > > Just upgraded to upstream master from 0.5 and seeing an issue at the end of > the delta sync run: > 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync > once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error > running delta sync once. Shutting > downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) > at > org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) > at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) > at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) > at > org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at > org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > java.io.IOException: Not an Avro data file at > org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at > org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147) > at > org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) > ... 24 more > > It is attempting to read an old cleanup file (2 month old) and crashing > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-732) Generate site to content folder
[ https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken resolved HUDI-732. - Resolution: Fixed > Generate site to content folder > --- > > Key: HUDI-732 > URL: https://issues.apache.org/jira/browse/HUDI-732 > Project: Apache Hudi (incubating) > Issue Type: Task > Components: Docs >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Remove test-content && Generate site to content -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching
[ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-686: Status: Open (was: New) > Implement BloomIndexV2 that does not depend on memory caching > - > > Key: HUDI-686 > URL: https://issues.apache.org/jira/browse/HUDI-686 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Index, Performance >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 0.6.0 > > Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot > 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, > image-2020-03-19-10-17-43-048.png > > > Main goals here is to provide a much simpler index, without advanced > optimizations like auto tuned parallelism/skew handling but a better > out-of-experience for small workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
lamber-ken created HUDI-742: --- Summary: Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I Key: HUDI-742 URL: https://issues.apache.org/jira/browse/HUDI-742 Project: Apache Hudi (incubating) Issue Type: Bug Components: Spark Integration Reporter: lamber-ken *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455 {code:java} at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) ... 49 elided Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I at org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211) at
[jira] [Updated] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
[ https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-742: Status: Open (was: New) > Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I > - > > Key: HUDI-742 > URL: https://issues.apache.org/jira/browse/HUDI-742 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Spark Integration >Reporter: lamber-ken >Priority: Major > > *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455 > {code:java} > at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193) > at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) > at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) > ... 49 elided > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in > stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): > java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I > at > org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at >
[jira] [Updated] (HUDI-723) SqlTransformer's schema sometimes is not registered.
[ https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-723: Status: Open (was: New) > SqlTransformer's schema sometimes is not registered. > - > > Key: HUDI-723 > URL: https://issues.apache.org/jira/browse/HUDI-723 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > > If schema is inferred from RowBasedSchemaProvider when SQL transformer is > used it also needs to be registered. > > Current way only works if SchemaProvider has a valid target schema. Is one > wants to use schema from SQL transformation, the result of > RowBasedSchemaProvider.getTargetSchema needs to be passed into something like: > {code:java} > private void setupWriteClient(SchemaProvider schemaProvider) { > LOG.info("Setting up Hoodie Write Client"); > registerAvroSchemas(schemaProvider); > HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider); > writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true); > onInitializingHoodieWriteClient.apply(writeClient); > } > {code} > Existent method will not work as it is checking for: > {code:java} > if ((null != schemaProvider) && (null == writeClient)) { > {code} > and writeClient is already configured. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-723) SqlTransformer's schema sometimes is not registered.
[ https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-723: --- Assignee: lamber-ken > SqlTransformer's schema sometimes is not registered. > - > > Key: HUDI-723 > URL: https://issues.apache.org/jira/browse/HUDI-723 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > > If schema is inferred from RowBasedSchemaProvider when SQL transformer is > used it also needs to be registered. > > Current way only works if SchemaProvider has a valid target schema. Is one > wants to use schema from SQL transformation, the result of > RowBasedSchemaProvider.getTargetSchema needs to be passed into something like: > {code:java} > private void setupWriteClient(SchemaProvider schemaProvider) { > LOG.info("Setting up Hoodie Write Client"); > registerAvroSchemas(schemaProvider); > HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider); > writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true); > onInitializingHoodieWriteClient.apply(writeClient); > } > {code} > Existent method will not work as it is checking for: > {code:java} > if ((null != schemaProvider) && (null == writeClient)) { > {code} > and writeClient is already configured. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-718: Status: Open (was: New) > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-718: --- Assignee: lamber-ken > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet
[ https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069160#comment-17069160 ] lamber-ken commented on HUDI-722: - Sure. Hi [~afilipchik], if you have time, can you share more context about it? e.g demo code, thanks. > IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when > writing parquet > - > > Key: HUDI-722 > URL: https://issues.apache.org/jira/browse/HUDI-722 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Writer Core >Reporter: Alexander Filipchik >Priority: Major > Fix For: 0.6.0 > > > Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array > range: X to X inside MessageColumnIORecordConsumer.addBinary call. > Specifically: getColumnWriter().write(value, r[currentLevel], > currentColumnIO.getDefinitionLevel()); > fails as size of r is the same as current level. What can be causing it? > > It gets executed via: ParquetWriter.write(IndexedRecord) Library version: > 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of > unions present). > But what is surprising is that it fails to write top level field: > PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is > the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", > "_hoodie_commit_seqno": "20200317215711_0_650", -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet
[ https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-722: --- Assignee: lamber-ken > IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when > writing parquet > - > > Key: HUDI-722 > URL: https://issues.apache.org/jira/browse/HUDI-722 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Writer Core >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > > Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array > range: X to X inside MessageColumnIORecordConsumer.addBinary call. > Specifically: getColumnWriter().write(value, r[currentLevel], > currentColumnIO.getDefinitionLevel()); > fails as size of r is the same as current level. What can be causing it? > > It gets executed via: ParquetWriter.write(IndexedRecord) Library version: > 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of > unions present). > But what is surprising is that it fails to write top level field: > PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is > the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", > "_hoodie_commit_seqno": "20200317215711_0_650", -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet
[ https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-722: Status: Open (was: New) > IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when > writing parquet > - > > Key: HUDI-722 > URL: https://issues.apache.org/jira/browse/HUDI-722 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Writer Core >Reporter: Alexander Filipchik >Priority: Major > Fix For: 0.6.0 > > > Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array > range: X to X inside MessageColumnIORecordConsumer.addBinary call. > Specifically: getColumnWriter().write(value, r[currentLevel], > currentColumnIO.getDefinitionLevel()); > fails as size of r is the same as current level. What can be causing it? > > It gets executed via: ParquetWriter.write(IndexedRecord) Library version: > 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of > unions present). > But what is surprising is that it fails to write top level field: > PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is > the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", > "_hoodie_commit_seqno": "20200317215711_0_650", -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists
[ https://issues.apache.org/jira/browse/HUDI-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken resolved HUDI-734. - Resolution: Fixed > Fix error: cannot create directory ‘test-content’: File exists > -- > > Key: HUDI-734 > URL: https://issues.apache.org/jira/browse/HUDI-734 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Docs >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Fix error: cannot create directory ‘test-content’: File exists -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists
[ https://issues.apache.org/jira/browse/HUDI-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-734: Status: Open (was: New) > Fix error: cannot create directory ‘test-content’: File exists > -- > > Key: HUDI-734 > URL: https://issues.apache.org/jira/browse/HUDI-734 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Docs >Reporter: lamber-ken >Priority: Major > > Fix error: cannot create directory ‘test-content’: File exists -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists
[ https://issues.apache.org/jira/browse/HUDI-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-734: --- Assignee: lamber-ken > Fix error: cannot create directory ‘test-content’: File exists > -- > > Key: HUDI-734 > URL: https://issues.apache.org/jira/browse/HUDI-734 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Docs >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Fix error: cannot create directory ‘test-content’: File exists -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-734) Fix error: cannot create directory ‘test-content’: File exists
lamber-ken created HUDI-734: --- Summary: Fix error: cannot create directory ‘test-content’: File exists Key: HUDI-734 URL: https://issues.apache.org/jira/browse/HUDI-734 Project: Apache Hudi (incubating) Issue Type: Bug Components: Docs Reporter: lamber-ken Fix error: cannot create directory ‘test-content’: File exists -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-732) Generate site to content folder
[ https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-732: Status: Open (was: New) > Generate site to content folder > --- > > Key: HUDI-732 > URL: https://issues.apache.org/jira/browse/HUDI-732 > Project: Apache Hudi (incubating) > Issue Type: Task > Components: Docs >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Remove test-content && Generate site to content -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-732) Generate site to content folder
lamber-ken created HUDI-732: --- Summary: Generate site to content folder Key: HUDI-732 URL: https://issues.apache.org/jira/browse/HUDI-732 Project: Apache Hudi (incubating) Issue Type: Task Components: Docs Reporter: lamber-ken Remove test-content && Generate site to content -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-732) Generate site to content folder
[ https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-732: --- Assignee: lamber-ken > Generate site to content folder > --- > > Key: HUDI-732 > URL: https://issues.apache.org/jira/browse/HUDI-732 > Project: Apache Hudi (incubating) > Issue Type: Task > Components: Docs >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Remove test-content && Generate site to content -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching
[ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065030#comment-17065030 ] lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM: --- right, this is a nice design, some thoughts: * if the input data is large, need to increase partitions, "candidates" contains all datas for per partition * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs() && populateRangeAndBloomFilters()) [https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java] {code:java} @Override public JavaRDD> tagLocation(JavaRDD> recordRDD, JavaSparkContext jsc, HoodieTable hoodieTable) { return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), record.getRecordKey()), true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable)) .flatMap(List::iterator) .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable)) .filter(Option::isPresent) .map(Option::get); } {code} {code:java} private void initIfNeeded(String partitionPath) throws IOException { if (!Objects.equals(partitionPath, currentPartitionPath)) { cleanup(); this.currentPartitionPath = partitionPath; populateFileIDs(); populateRangeAndBloomFilters(); } }{code} was (Author: lamber-ken): right, this is a nice design, some thoughts: * if the input data is large, need to increase partitions, "candidates" contains all partition datas * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs() && populateRangeAndBloomFilters()) [https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java] {code:java} @Override public JavaRDD> tagLocation(JavaRDD> recordRDD, JavaSparkContext jsc, HoodieTable hoodieTable) { return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), record.getRecordKey()), true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable)) .flatMap(List::iterator) .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable)) .filter(Option::isPresent) .map(Option::get); } {code} {code:java} private void initIfNeeded(String partitionPath) throws IOException { if (!Objects.equals(partitionPath, currentPartitionPath)) { cleanup(); this.currentPartitionPath = partitionPath; populateFileIDs(); populateRangeAndBloomFilters(); } }{code} > Implement BloomIndexV2 that does not depend on memory caching > - > > Key: HUDI-686 > URL: https://issues.apache.org/jira/browse/HUDI-686 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Index, Performance >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 0.6.0 > > Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot > 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, > image-2020-03-19-10-17-43-048.png > > > Main goals here is to provide a much simpler index, without advanced > optimizations like auto tuned parallelism/skew handling but a better > out-of-experience for small workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-504) Restructuring and auto-generation of docs
[ https://issues.apache.org/jira/browse/HUDI-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken closed HUDI-504. --- Resolution: Fixed > Restructuring and auto-generation of docs > - > > Key: HUDI-504 > URL: https://issues.apache.org/jira/browse/HUDI-504 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Docs >Reporter: Ethan Guo >Assignee: lamber-ken >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > RFC-10: Restructuring and auto-generation of docs > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-646) Re-enable TestUpdateSchemaEvolution after triaging weird CI issue
[ https://issues.apache.org/jira/browse/HUDI-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken closed HUDI-646. --- Resolution: Fixed > Re-enable TestUpdateSchemaEvolution after triaging weird CI issue > - > > Key: HUDI-646 > URL: https://issues.apache.org/jira/browse/HUDI-646 > Project: Apache Hudi (incubating) > Issue Type: Test > Components: Testing >Reporter: Vinoth Chandar >Assignee: lamber-ken >Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > https://github.com/apache/incubator-hudi/pull/1346/commits/5b20891619380a66e2a62c9e57fb28c4f5ed948b > undo this > {code} > Job aborted due to stage failure: Task 7 in stage 1.0 failed 1 times, most > recent failure: Lost task 7.0 in stage 1.0 (TID 15, localhost, executor > driver): org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > file:/tmp/junit3406952253616234024/2016/01/31/f1-0_7-0-7_100.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) > at > org.apache.hudi.common.util.ParquetUtils.readAvroRecords(ParquetUtils.java:190) > at > org.apache.hudi.client.TestUpdateSchemaEvolution.lambda$testSchemaEvolutionOnUpdate$dfb2f24e$1(TestUpdateSchemaEvolution.java:123) > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1334) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException: Byte-buffer read > unsupported by input stream > at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:146) > at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:143) > at > org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81) > at > org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90) > at > org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) > ... 29 more > {code} > Only happens on travis. Locally succeeded over 5000 times individually.. And > the
[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching
[ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065030#comment-17065030 ] lamber-ken commented on HUDI-686: - right, this is a nice design, some thoughts: * if the input data is large, need to increase partitions, "candidates" contains all partition datas * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs() && populateRangeAndBloomFilters()) [https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java] {code:java} @Override public JavaRDD> tagLocation(JavaRDD> recordRDD, JavaSparkContext jsc, HoodieTable hoodieTable) { return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), record.getRecordKey()), true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable)) .flatMap(List::iterator) .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable)) .filter(Option::isPresent) .map(Option::get); } {code} {code:java} private void initIfNeeded(String partitionPath) throws IOException { if (!Objects.equals(partitionPath, currentPartitionPath)) { cleanup(); this.currentPartitionPath = partitionPath; populateFileIDs(); populateRangeAndBloomFilters(); } }{code} > Implement BloomIndexV2 that does not depend on memory caching > - > > Key: HUDI-686 > URL: https://issues.apache.org/jira/browse/HUDI-686 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Index, Performance >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 0.6.0 > > Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot > 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, > image-2020-03-19-10-17-43-048.png > > > Main goals here is to provide a much simpler index, without advanced > optimizations like auto tuned parallelism/skew handling but a better > out-of-experience for small workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-730) Fix the ci error log message
[ https://issues.apache.org/jira/browse/HUDI-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-730: --- Assignee: lamber-ken > Fix the ci error log message > > > Key: HUDI-730 > URL: https://issues.apache.org/jira/browse/HUDI-730 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Testing >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > {code:java} > [ERROR] 2020-03-21 17:30:24,613 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error > running preferred function. Trying secondary > java.lang.RuntimeException > at > org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.mockito.internal.runners.JUnit45AndHigherRunnerImpl.run(JUnit45AndHigherRunnerImpl.java:37) > at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:367) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:274) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:161) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121) > [WARN ] 2020-03-21 17:30:24,628 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Routing > request to secondary file-system view > [WARN ] 2020-03-21 17:30:24,630 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Routing > request to secondary file-system view > [ERROR] 2020-03-21 17:30:24,638 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error > running preferred function. Trying secondary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-730) Fix the ci error log message
lamber-ken created HUDI-730: --- Summary: Fix the ci error log message Key: HUDI-730 URL: https://issues.apache.org/jira/browse/HUDI-730 Project: Apache Hudi (incubating) Issue Type: Bug Components: Testing Reporter: lamber-ken {code:java} [ERROR] 2020-03-21 17:30:24,613 org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary java.lang.RuntimeException at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.mockito.internal.runners.JUnit45AndHigherRunnerImpl.run(JUnit45AndHigherRunnerImpl.java:37) at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:367) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:274) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:161) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121) [WARN ] 2020-03-21 17:30:24,628 org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Routing request to secondary file-system view [WARN ] 2020-03-21 17:30:24,630 org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Routing request to secondary file-system view [ERROR] 2020-03-21 17:30:24,638 org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-730) Fix the ci error log message
[ https://issues.apache.org/jira/browse/HUDI-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-730: Status: Open (was: New) > Fix the ci error log message > > > Key: HUDI-730 > URL: https://issues.apache.org/jira/browse/HUDI-730 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Testing >Reporter: lamber-ken >Priority: Major > > {code:java} > [ERROR] 2020-03-21 17:30:24,613 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error > running preferred function. Trying secondary > java.lang.RuntimeException > at > org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.mockito.internal.runners.JUnit45AndHigherRunnerImpl.run(JUnit45AndHigherRunnerImpl.java:37) > at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:367) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:274) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:161) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121) > [WARN ] 2020-03-21 17:30:24,628 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Routing > request to secondary file-system view > [WARN ] 2020-03-21 17:30:24,630 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Routing > request to secondary file-system view > [ERROR] 2020-03-21 17:30:24,638 > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error > running preferred function. Trying secondary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession
[ https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-729: Status: In Progress (was: Open) > Replace JavaSparkContext/SQLContext with SparkSession > - > > Key: HUDI-729 > URL: https://issues.apache.org/jira/browse/HUDI-729 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Code Cleanup >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Replace JavaSparkContext/SQLContext with SparkSession. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession
[ https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-729: Status: Open (was: New) > Replace JavaSparkContext/SQLContext with SparkSession > - > > Key: HUDI-729 > URL: https://issues.apache.org/jira/browse/HUDI-729 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Code Cleanup >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Replace JavaSparkContext/SQLContext with SparkSession. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession
lamber-ken created HUDI-729: --- Summary: Replace JavaSparkContext/SQLContext with SparkSession Key: HUDI-729 URL: https://issues.apache.org/jira/browse/HUDI-729 Project: Apache Hudi (incubating) Issue Type: Sub-task Components: Code Cleanup Reporter: lamber-ken Replace JavaSparkContext/SQLContext with SparkSession. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession
[ https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-729: --- Assignee: lamber-ken > Replace JavaSparkContext/SQLContext with SparkSession > - > > Key: HUDI-729 > URL: https://issues.apache.org/jira/browse/HUDI-729 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Code Cleanup >Reporter: lamber-ken >Assignee: lamber-ken >Priority: Major > > Replace JavaSparkContext/SQLContext with SparkSession. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-401) Remove unnecessary use of spark in savepoint timeline
[ https://issues.apache.org/jira/browse/HUDI-401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken reassigned HUDI-401: --- Assignee: hong dongdong > Remove unnecessary use of spark in savepoint timeline > - > > Key: HUDI-401 > URL: https://issues.apache.org/jira/browse/HUDI-401 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: CLI, Writer Core >Reporter: hong dongdong >Assignee: hong dongdong >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Currently, javasparkcontext was inited when savepoint create, but it is not > necessary. Javasparkcontext's whole work is provide hadoopconfig, but need > time and resources to init it. > So we can use hadoop config instead of jsc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-719) Exception during clean phase: Found org.apache.hudi.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleanerPlan
[ https://issues.apache.org/jira/browse/HUDI-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063816#comment-17063816 ] lamber-ken commented on HUDI-719: - [~afilipchik] Thanks for reporting these issues during upgrade 0.5.0 to master branch, if you are interested in fix them, you can ask [~vinoth] give you contribute permission. :) > Exception during clean phase: Found > org.apache.hudi.avro.model.HoodieCleanMetadata, expecting > org.apache.hudi.avro.model.HoodieCleanerPlan > -- > > Key: HUDI-719 > URL: https://issues.apache.org/jira/browse/HUDI-719 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Priority: Major > Fix For: 0.6.0 > > > Dataset is written using 0.5 moving to the latest master: > > Exception in thread "main" org.apache.avro.AvroTypeException: Found > org.apache.hudi.avro.model.HoodieCleanMetadata, expecting > org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy > at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > at > org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) > at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) > at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) > at > org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149) > at > org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) > at > org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) > at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) > at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) > at > org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:397) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063813#comment-17063813 ] lamber-ken commented on HUDI-718: - Hi [~afilipchik], the master version use spark-2.4.4 which dependes on avro-1.82. here is a pr[1] which fix the similar issue, if you are interested, you can have a try [1] [https://github.com/apache/incubator-hudi/pull/1339] !image-2020-03-21-16-49-28-905.png|width=790,height=631! > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert
[ https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lamber-ken updated HUDI-718: Attachment: image-2020-03-21-16-49-28-905.png > java.lang.ClassCastException during upsert > -- > > Key: HUDI-718 > URL: https://issues.apache.org/jira/browse/HUDI-718 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-16-49-28-905.png > > > Dataset was created using hudi 0.5 and now trying to migrate it to the latest > master. The table is written using SqlTransformer. Exception: > > Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge > old record into new file for key bla.bla from old file > gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet > to new file > gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433) > at > org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be > cast to org.apache.avro.generic.GenericFixed > at > org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336) > at > org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275) > at > org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) > at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) > at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) > at > org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) > at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242) > ... 8 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean
[ https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063775#comment-17063775 ] lamber-ken commented on HUDI-716: - The life cycle of *.clean files: (base on hudi-0.5.0) [https://github.com/apache/incubator-hudi/blob/release-0.5.0|https://github.com/apache/incubator-hudi/blob/release-0.5.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java] !image-2020-03-21-13-37-17-039.png! > Exception: Not an Avro data file when running HoodieCleanClient.runClean > > > Key: HUDI-716 > URL: https://issues.apache.org/jira/browse/HUDI-716 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: DeltaStreamer >Reporter: Alexander Filipchik >Assignee: lamber-ken >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-03-21-02-45-25-099.png, > image-2020-03-21-13-37-17-039.png > > > Just upgraded to upstream master from 0.5 and seeing an issue at the end of > the delta sync run: > 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync > once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error > running delta sync once. Shutting > downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) > at > org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) > at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) > at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) > at > org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168) > at > org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at > org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > java.io.IOException: Not an Avro data file at > org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at > org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147) > at > org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) > ... 24 more > > It is attempting to read an old cleanup file (2 month old) and crashing > -- This message was sent by Atlassian Jira (v8.3.4#803005)