[ 
https://issues.apache.org/jira/browse/PHOENIX-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567110#comment-17567110
 ] 

Istvan Toth commented on PHOENIX-6698:
--------------------------------------

Let's circle back to a higher level.

Looking at 
org.apache.phoenix.mapreduce.PhoenixInputFormat.generateSplits(QueryPlan, 
Configuration) , I cannot see anything there that should take a significant 
amount of time.
The actual splitting by goalposts was already done when preparing the Query 
plan, and the PhoenixInputSplit is mostly just a POJO constructor.

I suspect that the parallalization that you introduce here is only masking some 
other inefficiency in the split generation, and we should fix that instead / as 
well.

Can you provide some finer grained profiling data on where excatly the 
(unmodified) generateSplits() is spending  ~2 seconds per region ?
Idally, something like a flame graph provided by asyncProfile would be the best.

> hive-connector will take long time to generate splits for large phoenix 
> tables.
> -------------------------------------------------------------------------------
>
>                 Key: PHOENIX-6698
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6698
>             Project: Phoenix
>          Issue Type: Improvement
>          Components: hive-connector
>    Affects Versions: 5.1.0
>            Reporter: jichen
>            Assignee: jichen
>            Priority: Minor
>             Fix For: connectors-6.0.0
>
>         Attachments: PHOENIX-6698.master.v1.patch
>
>
> {{{color:#1d1c1d}In our production environment, hive-phoenix connector  will 
> take nearly 30-40 minutes to generate splits for large phoenix table, which 
> has more than 2048 regions.it is because in class PhoenixInputFormat, 
> function  'generateSplits' only uses one thread to generate splits for each 
> scan. My proposal is to use multi-thread to generate splits in parallel. the 
> proposal has been validated in our production environment.by  changing code 
> {color}}}{color:#1d1c1d}to generate splits  in parallel with 24 threads, the 
> time cost is reduced to 2 minutes.  {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to