Joe McDonnell created IMPALA-9777:
-------------------------------------
Summary: Use Impala to load the text version of tpcds.store_sales
Key: IMPALA-9777
URL: https://issues.apache.org/jira/browse/IMPALA-9777
Project: IMPALA
Issue Type: Improvement
Components: Infrastructure
Affects Versions: Impala 4.0
Reporter: Joe McDonnell
Assignee: Joe McDonnell
Currently, dataload for the Impala development environment uses Hive to
populate tpcds.store_sales. We use several insert statements that select from
tpcds.stores_sales_unpartitioned, which is loaded from text files. The inserts
have this form:
{noformat}
insert overwrite table {table_name} partition(ss_sold_date_sk)
select ss_sold_time_sk,
ss_item_sk,
ss_customer_sk,
ss_cdemo_sk,
ss_hdemo_sk,
ss_addr_sk,
ss_store_sk,
ss_promo_sk,
ss_ticket_number,
ss_quantity,
ss_wholesale_cost,
ss_list_price,
ss_sales_price,
ss_ext_discount_amt,
ss_ext_sales_price,
ss_ext_wholesale_cost,
ss_ext_list_price,
ss_ext_tax,
ss_coupon_amt,
ss_net_paid,
ss_net_paid_inc_tax,
ss_net_profit,
ss_sold_date_sk
from store_sales_unpartitioned
WHERE ss_sold_date_sk < 2451272
distribute by ss_sold_date_sk;{noformat}
Since this is inserting into a partitioned table, it is creating a file per
partition. Each statement manipulates hundreds of partitions. The Hive
implementation of this insert opens several hundred files simultaneously (by my
measurement, ~450). HDFS reserves a whole block for each file (even though the
resulting files are not large), and if there isn't enough disk space for all of
the reservations, then these inserts can fail. This is a common problem on
development environments.
Impala uses clustered inserts where the input is sorted and files are written
one at a time (per backend). This limits the number of simultaneously open
files, eliminating the corresponding disk space reservation. Switching
populating tpcds.store_sales to use Impala would reduce the diskspace
requirement for an Impala developer environment.
This only applies to the text version of store_sales, which is created from
store_sales_unpartitioned. All other formats are created from the text version
of store_sales. Since the text store_sales is already partitioned in the same
way as the destination store_sales, Hive can be more efficient, processing a
small number of partitions at a time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]