[ 
https://issues.apache.org/jira/browse/IMPALA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118983#comment-17118983
 ] 

Sahil Takiar commented on IMPALA-9777:
--------------------------------------

Setting {{hive.optimize.sort.dynamic.partition}} to true does get data load to 
pass when EC is enabled, although it adds several minutes of latency to the 
TPC-DS dataload. [~joemcdonnell] how exactly do I use the Python script to 
estimate the number of open files?

I ran it against the two {{hdfs-namenode.log}} files available in the Jenkins 
output after running core tests on Impala-on-EC and saw the following:
{code:java}
./namenodeparse.py -f 
impala-private-parameterized-DEBUG-7212/logs/cluster/cdh7-node-1/hadoop-hdfs/hdfs-namenode.log.1

Filename: 127.0.0.1:31004 Begins: 3821 Ends: 0
Filename: 127.0.0.1:31002 Begins: 3747 Ends: 0
Filename: 127.0.0.1:31003 Begins: 3644 Ends: 0
Filename: 127.0.0.1:31000 Begins: 3802 Ends: 0
Filename: 127.0.0.1:31001 Begins: 3823 Ends: 0
Increments: 18837, Decrements: 0, Max Outstanding: 18837

./namenodeparse.py -f 
impala-private-parameterized-DEBUG-7212/logs/cluster/cdh7-node-1/hadoop-hdfs/hdfs-namenode.log

Filename: 127.0.0.1:31004 Begins: 559 Ends: 0
Filename: 127.0.0.1:31002 Begins: 556 Ends: 0
Filename: 127.0.0.1:31003 Begins: 560 Ends: 0
Filename: 127.0.0.1:31000 Begins: 529 Ends: 0
Filename: 127.0.0.1:31001 Begins: 533 Ends: 0
Increments: 2737, Decrements: 0, Max Outstanding: 2737
{code}

> Use Impala to load the text version of tpcds.store_sales
> --------------------------------------------------------
>
>                 Key: IMPALA-9777
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9777
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>         Attachments: namenodeparse.py
>
>
> Currently, dataload for the Impala development environment uses Hive to 
> populate tpcds.store_sales. We use several insert statements that select from 
> tpcds.stores_sales_unpartitioned, which is loaded from text files. The 
> inserts have this form:
> {noformat}
> insert overwrite table {table_name} partition(ss_sold_date_sk)
> select ss_sold_time_sk,
>   ss_item_sk,
>   ss_customer_sk,
>   ss_cdemo_sk,
>   ss_hdemo_sk,
>   ss_addr_sk,
>   ss_store_sk,
>   ss_promo_sk,
>   ss_ticket_number,
>   ss_quantity,
>   ss_wholesale_cost,
>   ss_list_price,
>   ss_sales_price,
>   ss_ext_discount_amt,
>   ss_ext_sales_price,
>   ss_ext_wholesale_cost,
>   ss_ext_list_price,
>   ss_ext_tax,
>   ss_coupon_amt,
>   ss_net_paid,
>   ss_net_paid_inc_tax,
>   ss_net_profit,
>   ss_sold_date_sk
> from store_sales_unpartitioned
> WHERE ss_sold_date_sk < 2451272
> distribute by ss_sold_date_sk;{noformat}
> Since this is inserting into a partitioned table, it is creating a file per 
> partition. Each statement manipulates hundreds of partitions. The Hive 
> implementation of this insert opens several hundred files simultaneously (by 
> my measurement, ~450). HDFS reserves a whole block for each file (even though 
> the resulting files are not large), and if there isn't enough disk space for 
> all of the reservations, then these inserts can fail. This is a common 
> problem on development environments.
> Impala uses clustered inserts where the input is sorted and files are written 
> one at a time (per backend). This limits the number of simultaneously open 
> files, eliminating the corresponding disk space reservation. Switching 
> populating tpcds.store_sales to use Impala would reduce the diskspace 
> requirement for an Impala developer environment.
> This only applies to the text version of store_sales, which is created from 
> store_sales_unpartitioned. All other formats are created from the text 
> version of store_sales. Since the text store_sales is already partitioned in 
> the same way as the destination store_sales, Hive can be more efficient, 
> processing a small number of partitions at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to