[ 
https://issues.apache.org/jira/browse/IMPALA-10900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413271#comment-17413271
 ] 

ASF subversion and git services commented on IMPALA-10900:
----------------------------------------------------------

Commit 6b4693ddbfd8111344dc88443d76bd3b034f361e in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=6b4693d ]

IMPALA-10900: Add Iceberg tests that write many files

In earlier versions of Impala we had a bug that affected
insertions to Iceberg tables. When Impala wrote multiple
files during a single INSERT statement it could crash, or
even worse, it could silently omit data files from the
Iceberg metadata.

The current master doesn't have this bug, but we don't
really have tests for this case.

This patch adds tests that write many files during inserts
to an Iceberg table. Both non-partitioned and partitioned
Iceberg tables are tested.

We achieve writing lots of files by setting 'parquet_file_size'
to 8 megabytes.

Testing:
 * added e2e test that write many data files
 * added exhaustive e2e test that writes even more data files

Change-Id: Ia2dbc2c5f9574153842af308a61f9d91994d067b
Reviewed-on: http://gerrit.cloudera.org:8080/17831
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Some parquet files are missing from Iceberg avro metadata
> ---------------------------------------------------------
>
>                 Key: IMPALA-10900
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10900
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0.0
>            Reporter: Riza Suminto
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>         Attachments: 9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro, 
> profile_ec48089182b28ba9_b2910c2d00000000_iceberg_ctas
>
>
> I tried to load tpcds_3000_iceberg database based on existing parquet 
> database. The source database is tpcds_3000_parquet (3TB scale) and the 
> cluster has 10 nodes.
> After loading table catalog_sales, I found that the row count of 
> tpcds_3000_iceberg.catalog_sales is less than row count of 
> tpcds_3000_parquet.catalog_sales. Further debugging reveals that the CTAS 
> query actually finish writing parquet files, but only one parquet file per 
> partition gets written into Iceberg avro metadata.
> For example inspecting partiton 2451120, it says that there are 2 parquet 
> files
> {code:java}
>  ~  sudo -u hdfs hdfs dfs -ls 
> /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/
>  Found 2 items
>  rw-rw---+ 3 impala hive 264837186 2021-08-28 20:05 
> /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq
>  rw-rw---+ 3 impala hive 80608775 2021-08-28 20:05 
> /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.1.parq{code}
>  However, the avro files only have 
> ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq in it
> {code:java}
>  ~  avro-tools tojson /tmp/9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro| 
> grep "cs_sold_date_sk=2451120"
> [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> {"status":1,"snapshot_id":{"long":403055188554782479},"data_file":{"file_path":"hdfs://ve1315.halxg.cloudera.com:8020/warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq","file_format":"PARQUET","partition":{"cs_sold_date_sk":{"long":2451120}},"record_count":3616116,"file_size_in_bytes":264837186,"block_size_in_bytes":67108864,"column_sizes":null,"value_counts":null,"null_value_counts":null,"nan_value_counts":null,"lower_bounds":null,"upper_bounds":null,"key_metadata":null,"split_offsets":null}}
> {code}
> The CTAS query that I use on debugging is the following:
> {code:java}
> create table catalog_sales
> partitioned by (cs_sold_date_sk)
> stored as iceberg
> as select
>   cs_sold_time_sk,
>   cs_ship_date_sk,
>   cs_bill_customer_sk,
>   cs_bill_cdemo_sk,
>   cs_bill_hdemo_sk,
>   cs_bill_addr_sk,
>   cs_ship_customer_sk,
>   cs_ship_cdemo_sk,
>   cs_ship_hdemo_sk,
>   cs_ship_addr_sk,
>   cs_call_center_sk,
>   cs_catalog_page_sk,
>   cs_ship_mode_sk,
>   cs_warehouse_sk,
>   cs_item_sk,
>   cs_promo_sk,
>   cs_order_number,
>   cs_quantity,
>   cs_wholesale_cost,
>   cs_list_price,
>   cs_sales_price,
>   cs_ext_discount_amt,
>   cs_ext_sales_price,
>   cs_ext_wholesale_cost,
>   cs_ext_list_price,
>   cs_ext_tax,
>   cs_coupon_amt,
>   cs_ext_ship_cost,
>   cs_net_paid,
>   cs_net_paid_inc_tax,
>   cs_net_paid_inc_ship,
>   cs_net_paid_inc_ship_tax,
>   cs_net_profit,
>   cs_sold_date_sk
> from tpcds_3000_parquet.catalog_sales where cs_sold_date_sk > 2451100 and 
> cs_sold_date_sk <= 2451200;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to