[ 
https://issues.apache.org/jira/browse/PHOENIX-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533839#comment-17533839
 ] 

ASF GitHub Bot commented on PHOENIX-6698:
-----------------------------------------

jichen20210919 opened a new pull request, #79:
URL: https://github.com/apache/phoenix-connectors/pull/79

   This patch enables PhoenixInputFormat to generate splits in parallel, it 
introduce two parameters to control the degree of parallelism.
   1.'hive.phoenix.split.parallel.threshold' is used to contrl if split should 
be generated in parallel.it will generate splits in serial for following 
condition:
   (1) hive.phoenix.split.parallel.threshold<0, it will generate split in 
serial.
   (2) number of scans in query plan is less than the value setting.
   in other conditions, it will generate split in parallel.
   2. hive.phoenix.split.parallel.level
   is used to control the number of work threads for the splits.(2*cpu cores by 
default).
   A unit test is created for unit test,  the test case will compare the time 
cost of generating split for phoenix table with 128 regions.
   the output shows that: parallel method is 6x faster than serial method, and 
it will be better for tables with more regions
   ```
   SLF4J: Class path contains multiple SLF4J bindings.
   SLF4J: Found binding in 
[jar:file:/home/jichen/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.10.0/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: Found binding in 
[jar:file:/home/jichen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
   SLF4J: Actual binding is of type 
[org.apache.logging.slf4j.Log4jLoggerFactory]
   Formatting using clusterid: testClusterID
   generate testing table with 128 splits
   get split in serial requires:12843 ms
   get split in parallel requires:2728 ms
   ```
   in production environment, we have tested the time cost for table with 2048 
regions, it reduces time cost from nearly 30 mins to 2 mins with default 
configuration.




> hive-connector will take long time to generate splits for large phoenix 
> tables.
> -------------------------------------------------------------------------------
>
>                 Key: PHOENIX-6698
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6698
>             Project: Phoenix
>          Issue Type: Improvement
>          Components: hive-connector
>    Affects Versions: 5.1.0
>            Reporter: jichen
>            Assignee: jichen
>            Priority: Minor
>             Fix For: connectors-6.0.0
>
>         Attachments: PHOENIX-6698.master.v1.patch
>
>
> {{{color:#1d1c1d}In our production environment, hive-phoenix connector  will 
> take nearly 30-40 minutes to generate splits for large phoenix table, which 
> has more than 2048 regions.it is because in class PhoenixInputFormat, 
> function  'generateSplits' only uses one thread to generate splits for each 
> scan. My proposal is to use multi-thread to generate splits in parallel. the 
> proposal has been validated in our production environment.by  changing code 
> {color}}}{color:#1d1c1d}to generate splits  in parallel with 24 threads, the 
> time cost is reduced to 2 minutes.  {color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to