hudi-bot opened a new issue, #15296:
URL: https://github.com/apache/hudi/issues/15296

   I am trying to sync a huge table with 4000+ fields into hudi, using cow 
table with bulk_insert  operate type.
   
   The job can finished without any exception,but when I am trying to read data 
from the table,I get empty result.The parquet file is corrupted, can't be read 
correctly. 
   
   I had tried to  trace the problem, and found it was caused by SortOperator. 
After the record is serialized in the sorter, all the field get disorder and is 
deserialized into one field.And finally the wrong record is written into 
parquet file,and make the file unreadable.
   
   Here's a few steps to reproduce the bug in the flink sql-client:
   
   1、execute the table ddl(provided in the table.ddl file  in the attachments)
   
   2、execute the insert statement (provided in the statement.sql file  in the 
attachments)
   
   3、execute a select statement to query hudi table  (provided in the 
statement.sql file  in the attachments)
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-4459
   - Type: Bug
   - Attachment(s):
     - 25/Jul/22 06:36;Leo 
zhang;statements.sql;https://issues.apache.org/jira/secure/attachment/13047162/statements.sql
     - 25/Jul/22 06:35;Leo 
zhang;table.ddl;https://issues.apache.org/jira/secure/attachment/13047163/table.ddl
   
   
   ---
   
   
   ## Comments
   
   26/Jul/22 00:07;rmahindra;[~danny0405] can you help assign this ticket?;;;
   
   ---
   
   23/Aug/22 08:44;danny0405;Thanks for taking up this issue, assigned to you 
[~rmahindra] :);;;
   
   ---
   
   18/Apr/23 05:07;StarBoy1005;Hi! I met a problem, I use flink 1.14.5 and hudi 
0.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming
    mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the 
instant always in flight state.
   this is my cow table ddl:
   
   create table web_returns_cow (
      rid bigint PRIMARY KEY NOT ENFORCED,
      wr_returned_date_sk bigint,
      wr_returned_time_sk bigint,
      wr_item_sk bigint,
      wr_refunded_customer_sk bigint,
      wr_refunded_cdemo_sk bigint,
      wr_refunded_hdemo_sk bigint,
      wr_refunded_addr_sk bigint,
      wr_returning_customer_sk bigint,
      wr_returning_cdemo_sk bigint,
      wr_returning_hdemo_sk bigint,
      wr_returning_addr_sk bigint,
      wr_web_page_sk bigint,
      wr_reason_sk bigint,
      wr_order_number bigint,
      wr_return_quantity int,
      wr_return_amt float,
      wr_return_tax float,
      wr_return_amt_inc_tax float,
      wr_fee float,
      wr_return_ship_cost float,
      wr_refunded_cash float,
      wr_reversed_charge float,
      wr_account_credit float,
      wr_net_loss float
   )
   PARTITIONED BY (`wr_returned_date_sk`)
    WITH (
   'connector'='hudi',
   'path'='/tmp/data_gen/web_returns_cow',
   'table.type'='COPY_ON_WRITE',
   'read.start-commit'='earliest',
   'read.streaming.enabled'='false',
   'changelog.enabled'='true',
   'write.precombine'='false',
   'write.precombine.field'='no_precombine',
   'write.operation'='bulk_insert',
   'read.tasks'='5',
   'write.tasks'='10',
   'index.type'='BUCKET',
   'metadata.enabled'='false',
   'hoodie.bucket.index.hash.field'='rid',
   'hoodie.bucket.index.num.buckets'='10',
   'index.global.enabled'='false'
   );;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to