jichen20210919 opened a new pull request, #79: URL: https://github.com/apache/phoenix-connectors/pull/79
This patch enables PhoenixInputFormat to generate splits in parallel, it introduce two parameters to control the degree of parallelism. 1.'hive.phoenix.split.parallel.threshold' is used to contrl if split should be generated in parallel.it will generate splits in serial for following condition: (1) hive.phoenix.split.parallel.threshold<0, it will generate split in serial. (2) number of scans in query plan is less than the value setting. in other conditions, it will generate split in parallel. 2. hive.phoenix.split.parallel.level is used to control the number of work threads for the splits.(2*cpu cores by default). A unit test is created for unit test, the test case will compare the time cost of generating split for phoenix table with 128 regions. the output shows that: parallel method is 6x faster than serial method, and it will be better for tables with more regions ``` SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/jichen/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.10.0/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/jichen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Formatting using clusterid: testClusterID generate testing table with 128 splits get split in serial requires:12843 ms get split in parallel requires:2728 ms ``` in production environment, we have tested the time cost for table with 2048 regions, it reduces time cost from nearly 30 mins to 2 mins with default configuration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
