[
https://issues.apache.org/jira/browse/PHOENIX-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567110#comment-17567110
]
Istvan Toth commented on PHOENIX-6698:
--------------------------------------
Let's circle back to a higher level.
Looking at
org.apache.phoenix.mapreduce.PhoenixInputFormat.generateSplits(QueryPlan,
Configuration) , I cannot see anything there that should take a significant
amount of time.
The actual splitting by goalposts was already done when preparing the Query
plan, and the PhoenixInputSplit is mostly just a POJO constructor.
I suspect that the parallalization that you introduce here is only masking some
other inefficiency in the split generation, and we should fix that instead / as
well.
Can you provide some finer grained profiling data on where excatly the
(unmodified) generateSplits() is spending ~2 seconds per region ?
Idally, something like a flame graph provided by asyncProfile would be the best.
> hive-connector will take long time to generate splits for large phoenix
> tables.
> -------------------------------------------------------------------------------
>
> Key: PHOENIX-6698
> URL: https://issues.apache.org/jira/browse/PHOENIX-6698
> Project: Phoenix
> Issue Type: Improvement
> Components: hive-connector
> Affects Versions: 5.1.0
> Reporter: jichen
> Assignee: jichen
> Priority: Minor
> Fix For: connectors-6.0.0
>
> Attachments: PHOENIX-6698.master.v1.patch
>
>
> {{{color:#1d1c1d}In our production environment, hive-phoenix connector will
> take nearly 30-40 minutes to generate splits for large phoenix table, which
> has more than 2048 regions.it is because in class PhoenixInputFormat,
> function 'generateSplits' only uses one thread to generate splits for each
> scan. My proposal is to use multi-thread to generate splits in parallel. the
> proposal has been validated in our production environment.by changing code
> {color}}}{color:#1d1c1d}to generate splits in parallel with 24 threads, the
> time cost is reduced to 2 minutes. {color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)