[ https://issues.apache.org/jira/browse/HIVE-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969797#comment-13969797 ]
Prasanth J commented on HIVE-6872: ---------------------------------- [~rajesh.balamohan] Can you please post the patch in Review Board? Here is the link https://reviews.apache.org/r/new/ > Explore options of optimizing FileSinkOperator-->getDynOutPaths() > ----------------------------------------------------------------- > > Key: HIVE-6872 > URL: https://issues.apache.org/jira/browse/HIVE-6872 > Project: Hive > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Priority: Critical > Attachments: HIVE-6782-v3.patch, HIVE-6782-v4.patch > > > 1. Download hive-testbench from > https://github.com/cartershanklin/hive-testbench > 2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned" > 3. Most of the data population for tables with "partition + bucket + sorted > data" will run a lot slower even with scale factor of 10 on 20 node cluster. > Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries > to close FSPath writers. Every call takes almost 150-200 ms. > set hive.enforce.bucketing=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.exec.max.dynamic.partitions.pernode=4096; > With the above setting, one of the data loading (for web_sales table) took > almost 4096 * 150 = 600 seconds just in closing the writers sequentially. > Purpose of this jira is to figure out options of optimizing > FileSinkOperator-->getDynOutPaths() code path. This will be beneficial > especially in ETL type of workloads. -- This message was sent by Atlassian JIRA (v6.2#6252)