[
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390
]
StarBoy1005 edited comment on HUDI-4459 at 4/18/23 6:32 AM:
------------------------------------------------------------
Hi! I met a problem, I use flink 1.14.5 and hudi 0.13.0, read a csv file in
hdfs and sink to hudi cow table. no matter streaming
mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant
always in flight state.
this is my cow table ddl:
create table web_returns_cow (
rid bigint PRIMARY KEY NOT ENFORCED,
wr_returned_date_sk bigint,
wr_returned_time_sk bigint,
wr_item_sk bigint,
wr_refunded_customer_sk bigint,
wr_refunded_cdemo_sk bigint,
wr_refunded_hdemo_sk bigint,
wr_refunded_addr_sk bigint,
wr_returning_customer_sk bigint,
wr_returning_cdemo_sk bigint,
wr_returning_hdemo_sk bigint,
wr_returning_addr_sk bigint,
wr_web_page_sk bigint,
wr_reason_sk bigint,
wr_order_number bigint,
wr_return_quantity int,
wr_return_amt float,
wr_return_tax float,
wr_return_amt_inc_tax float,
wr_fee float,
wr_return_ship_cost float,
wr_refunded_cash float,
wr_reversed_charge float,
wr_account_credit float,
wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);
was (Author: JIRAUSER289640):
Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in
hdfs and sink to hudi cow table. no matter streaming
mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant
always in flight state.
this is my cow table ddl:
create table web_returns_cow (
rid bigint PRIMARY KEY NOT ENFORCED,
wr_returned_date_sk bigint,
wr_returned_time_sk bigint,
wr_item_sk bigint,
wr_refunded_customer_sk bigint,
wr_refunded_cdemo_sk bigint,
wr_refunded_hdemo_sk bigint,
wr_refunded_addr_sk bigint,
wr_returning_customer_sk bigint,
wr_returning_cdemo_sk bigint,
wr_returning_hdemo_sk bigint,
wr_returning_addr_sk bigint,
wr_web_page_sk bigint,
wr_reason_sk bigint,
wr_order_number bigint,
wr_return_quantity int,
wr_return_amt float,
wr_return_tax float,
wr_return_amt_inc_tax float,
wr_fee float,
wr_return_ship_cost float,
wr_refunded_cash float,
wr_reversed_charge float,
wr_account_credit float,
wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);
> Corrupt parquet file created when syncing huge table with 4000+ fields,using
> hudi cow table with bulk_insert type
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-4459
> URL: https://issues.apache.org/jira/browse/HUDI-4459
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Leo zhang
> Assignee: Rajesh Mahindra
> Priority: Major
> Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table
> with bulk_insert operate type.
> The job can finished without any exception,but when I am trying to read data
> from the table,I get empty result.The parquet file is corrupted, can't be
> read correctly.
> I had tried to trace the problem, and found it was caused by SortOperator.
> After the record is serialized in the sorter, all the field get disorder and
> is deserialized into one field.And finally the wrong record is written into
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file in the attachments)
> 2、execute the insert statement (provided in the statement.sql file in the
> attachments)
> 3、execute a select statement to query hudi table (provided in the
> statement.sql file in the attachments)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)