[
https://issues.apache.org/jira/browse/IMPALA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117200#comment-17117200
]
Joe McDonnell commented on IMPALA-9777:
---------------------------------------
Uploaded python script that parses a namenode log and prints out the number of
outstanding files. This is where I get the estimate of 450 open files.
> Use Impala to load the text version of tpcds.store_sales
> --------------------------------------------------------
>
> Key: IMPALA-9777
> URL: https://issues.apache.org/jira/browse/IMPALA-9777
> Project: IMPALA
> Issue Type: Improvement
> Components: Infrastructure
> Affects Versions: Impala 4.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Major
> Attachments: namenodeparse.py
>
>
> Currently, dataload for the Impala development environment uses Hive to
> populate tpcds.store_sales. We use several insert statements that select from
> tpcds.stores_sales_unpartitioned, which is loaded from text files. The
> inserts have this form:
> {noformat}
> insert overwrite table {table_name} partition(ss_sold_date_sk)
> select ss_sold_time_sk,
> ss_item_sk,
> ss_customer_sk,
> ss_cdemo_sk,
> ss_hdemo_sk,
> ss_addr_sk,
> ss_store_sk,
> ss_promo_sk,
> ss_ticket_number,
> ss_quantity,
> ss_wholesale_cost,
> ss_list_price,
> ss_sales_price,
> ss_ext_discount_amt,
> ss_ext_sales_price,
> ss_ext_wholesale_cost,
> ss_ext_list_price,
> ss_ext_tax,
> ss_coupon_amt,
> ss_net_paid,
> ss_net_paid_inc_tax,
> ss_net_profit,
> ss_sold_date_sk
> from store_sales_unpartitioned
> WHERE ss_sold_date_sk < 2451272
> distribute by ss_sold_date_sk;{noformat}
> Since this is inserting into a partitioned table, it is creating a file per
> partition. Each statement manipulates hundreds of partitions. The Hive
> implementation of this insert opens several hundred files simultaneously (by
> my measurement, ~450). HDFS reserves a whole block for each file (even though
> the resulting files are not large), and if there isn't enough disk space for
> all of the reservations, then these inserts can fail. This is a common
> problem on development environments.
> Impala uses clustered inserts where the input is sorted and files are written
> one at a time (per backend). This limits the number of simultaneously open
> files, eliminating the corresponding disk space reservation. Switching
> populating tpcds.store_sales to use Impala would reduce the diskspace
> requirement for an Impala developer environment.
> This only applies to the text version of store_sales, which is created from
> store_sales_unpartitioned. All other formats are created from the text
> version of store_sales. Since the text store_sales is already partitioned in
> the same way as the destination store_sales, Hive can be more efficient,
> processing a small number of partitions at a time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]