[
https://issues.apache.org/jira/browse/IMPALA-10900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy resolved IMPALA-10900.
----------------------------------------
Fix Version/s: Impala 4.1.0
Resolution: Fixed
> Some parquet files are missing from Iceberg avro metadata
> ---------------------------------------------------------
>
> Key: IMPALA-10900
> URL: https://issues.apache.org/jira/browse/IMPALA-10900
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0.0
> Reporter: Riza Suminto
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg
> Fix For: Impala 4.1.0
>
> Attachments: 9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro,
> profile_ec48089182b28ba9_b2910c2d00000000_iceberg_ctas
>
>
> I tried to load tpcds_3000_iceberg database based on existing parquet
> database. The source database is tpcds_3000_parquet (3TB scale) and the
> cluster has 10 nodes.
> After loading table catalog_sales, I found that the row count of
> tpcds_3000_iceberg.catalog_sales is less than row count of
> tpcds_3000_parquet.catalog_sales. Further debugging reveals that the CTAS
> query actually finish writing parquet files, but only one parquet file per
> partition gets written into Iceberg avro metadata.
> For example inspecting partiton 2451120, it says that there are 2 parquet
> files
> {code:java}
> ~ sudo -u hdfs hdfs dfs -ls
> /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/
> Found 2 items
> rw-rw---+ 3 impala hive 264837186 2021-08-28 20:05
> /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq
> rw-rw---+ 3 impala hive 80608775 2021-08-28 20:05
> /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.1.parq{code}
> However, the avro files only have
> ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq in it
> {code:java}
> ~ avro-tools tojson /tmp/9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro|
> grep "cs_sold_date_sk=2451120"
> [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> {"status":1,"snapshot_id":{"long":403055188554782479},"data_file":{"file_path":"hdfs://ve1315.halxg.cloudera.com:8020/warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq","file_format":"PARQUET","partition":{"cs_sold_date_sk":{"long":2451120}},"record_count":3616116,"file_size_in_bytes":264837186,"block_size_in_bytes":67108864,"column_sizes":null,"value_counts":null,"null_value_counts":null,"nan_value_counts":null,"lower_bounds":null,"upper_bounds":null,"key_metadata":null,"split_offsets":null}}
> {code}
> The CTAS query that I use on debugging is the following:
> {code:java}
> create table catalog_sales
> partitioned by (cs_sold_date_sk)
> stored as iceberg
> as select
> cs_sold_time_sk,
> cs_ship_date_sk,
> cs_bill_customer_sk,
> cs_bill_cdemo_sk,
> cs_bill_hdemo_sk,
> cs_bill_addr_sk,
> cs_ship_customer_sk,
> cs_ship_cdemo_sk,
> cs_ship_hdemo_sk,
> cs_ship_addr_sk,
> cs_call_center_sk,
> cs_catalog_page_sk,
> cs_ship_mode_sk,
> cs_warehouse_sk,
> cs_item_sk,
> cs_promo_sk,
> cs_order_number,
> cs_quantity,
> cs_wholesale_cost,
> cs_list_price,
> cs_sales_price,
> cs_ext_discount_amt,
> cs_ext_sales_price,
> cs_ext_wholesale_cost,
> cs_ext_list_price,
> cs_ext_tax,
> cs_coupon_amt,
> cs_ext_ship_cost,
> cs_net_paid,
> cs_net_paid_inc_tax,
> cs_net_paid_inc_ship,
> cs_net_paid_inc_ship_tax,
> cs_net_profit,
> cs_sold_date_sk
> from tpcds_3000_parquet.catalog_sales where cs_sold_date_sk > 2451100 and
> cs_sold_date_sk <= 2451200;
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]