[
https://issues.apache.org/jira/browse/PHOENIX-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537926#comment-17537926
]
ASF GitHub Bot commented on PHOENIX-6694:
-----------------------------------------
stoty commented on code in PR #80:
URL: https://github.com/apache/phoenix-connectors/pull/80#discussion_r874344921
##########
phoenix-spark-base/src/main/java/org/apache/phoenix/spark/datasource/v2/reader/PhoenixInputPartitionReader.java:
##########
@@ -94,6 +99,10 @@ private QueryPlan getQueryPlan() throws SQLException {
}
try (Connection conn = DriverManager.getConnection(
JDBC_PROTOCOL + JDBC_PROTOCOL_SEPARATOR + zkUrl,
overridingProps)) {
+ PTable pTable = PTable.parseFrom(options.getTableBytes());
+ org.apache.phoenix.schema.PTable table =
PTableImpl.createFromProto(pTable);
+ PhoenixConnection phoenixConnection =
conn.unwrap(PhoenixConnection.class);
+ phoenixConnection.addTable(table, System.currentTimeMillis());
Review Comment:
Interesting point about the timestamp.
The point of this patch is to avoid hammering the system tables with a huge
number of parallel requests.
I think that if we have executor starvation, then the jobs will not be
started immediately, and the syscat load is not really a problem.
Can you think of a case of when the jobs are delayed enough for this to
matter, but enough of them start up synchrounously for that to be a problem (I
don't know enough about Spark to tell) ?
> Avoid unnecessary calls of fetching table meta data to region servers holding
> the system tables in batch oriented jobs in spark or hive otherwise those RS
> become hotspot
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-6694
> URL: https://issues.apache.org/jira/browse/PHOENIX-6694
> Project: Phoenix
> Issue Type: Task
> Components: hive-connector, spark-connector
> Reporter: Rajeshbabu Chintaguntla
> Assignee: Rajeshbabu Chintaguntla
> Priority: Major
>
> Currently we are preparing the query plan in both data source and partition
> readers which is creating new connection in each worker and job
> initialisation which unnecessarily touch basing all both system catalog
> table, system stats table as well as meta. When there are jobs with millions
> of parallel workers hotspot the region servers holding the meta and system
> catalog as well system stats table. So if we share the same query plan
> between the workers which can avoid the hotspot.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)