[
https://issues.apache.org/jira/browse/IMPALA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118983#comment-17118983
]
Sahil Takiar commented on IMPALA-9777:
--------------------------------------
Setting {{hive.optimize.sort.dynamic.partition}} to true does get data load to
pass when EC is enabled, although it adds several minutes of latency to the
TPC-DS dataload. [~joemcdonnell] how exactly do I use the Python script to
estimate the number of open files?
I ran it against the two {{hdfs-namenode.log}} files available in the Jenkins
output after running core tests on Impala-on-EC and saw the following:
{code:java}
./namenodeparse.py -f
impala-private-parameterized-DEBUG-7212/logs/cluster/cdh7-node-1/hadoop-hdfs/hdfs-namenode.log.1
Filename: 127.0.0.1:31004 Begins: 3821 Ends: 0
Filename: 127.0.0.1:31002 Begins: 3747 Ends: 0
Filename: 127.0.0.1:31003 Begins: 3644 Ends: 0
Filename: 127.0.0.1:31000 Begins: 3802 Ends: 0
Filename: 127.0.0.1:31001 Begins: 3823 Ends: 0
Increments: 18837, Decrements: 0, Max Outstanding: 18837
./namenodeparse.py -f
impala-private-parameterized-DEBUG-7212/logs/cluster/cdh7-node-1/hadoop-hdfs/hdfs-namenode.log
Filename: 127.0.0.1:31004 Begins: 559 Ends: 0
Filename: 127.0.0.1:31002 Begins: 556 Ends: 0
Filename: 127.0.0.1:31003 Begins: 560 Ends: 0
Filename: 127.0.0.1:31000 Begins: 529 Ends: 0
Filename: 127.0.0.1:31001 Begins: 533 Ends: 0
Increments: 2737, Decrements: 0, Max Outstanding: 2737
{code}
> Use Impala to load the text version of tpcds.store_sales
> --------------------------------------------------------
>
> Key: IMPALA-9777
> URL: https://issues.apache.org/jira/browse/IMPALA-9777
> Project: IMPALA
> Issue Type: Improvement
> Components: Infrastructure
> Affects Versions: Impala 4.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Major
> Attachments: namenodeparse.py
>
>
> Currently, dataload for the Impala development environment uses Hive to
> populate tpcds.store_sales. We use several insert statements that select from
> tpcds.stores_sales_unpartitioned, which is loaded from text files. The
> inserts have this form:
> {noformat}
> insert overwrite table {table_name} partition(ss_sold_date_sk)
> select ss_sold_time_sk,
> ss_item_sk,
> ss_customer_sk,
> ss_cdemo_sk,
> ss_hdemo_sk,
> ss_addr_sk,
> ss_store_sk,
> ss_promo_sk,
> ss_ticket_number,
> ss_quantity,
> ss_wholesale_cost,
> ss_list_price,
> ss_sales_price,
> ss_ext_discount_amt,
> ss_ext_sales_price,
> ss_ext_wholesale_cost,
> ss_ext_list_price,
> ss_ext_tax,
> ss_coupon_amt,
> ss_net_paid,
> ss_net_paid_inc_tax,
> ss_net_profit,
> ss_sold_date_sk
> from store_sales_unpartitioned
> WHERE ss_sold_date_sk < 2451272
> distribute by ss_sold_date_sk;{noformat}
> Since this is inserting into a partitioned table, it is creating a file per
> partition. Each statement manipulates hundreds of partitions. The Hive
> implementation of this insert opens several hundred files simultaneously (by
> my measurement, ~450). HDFS reserves a whole block for each file (even though
> the resulting files are not large), and if there isn't enough disk space for
> all of the reservations, then these inserts can fail. This is a common
> problem on development environments.
> Impala uses clustered inserts where the input is sorted and files are written
> one at a time (per backend). This limits the number of simultaneously open
> files, eliminating the corresponding disk space reservation. Switching
> populating tpcds.store_sales to use Impala would reduce the diskspace
> requirement for an Impala developer environment.
> This only applies to the text version of store_sales, which is created from
> store_sales_unpartitioned. All other formats are created from the text
> version of store_sales. Since the text store_sales is already partitioned in
> the same way as the destination store_sales, Hive can be more efficient,
> processing a small number of partitions at a time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]