[
https://issues.apache.org/jira/browse/PHOENIX-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541721#comment-17541721
]
ASF GitHub Bot commented on PHOENIX-6694:
-----------------------------------------
joshelser commented on code in PR #80:
URL: https://github.com/apache/phoenix-connectors/pull/80#discussion_r880977239
##########
phoenix-spark-base/src/main/java/org/apache/phoenix/spark/datasource/v2/reader/PhoenixInputPartitionReader.java:
##########
@@ -94,6 +99,10 @@ private QueryPlan getQueryPlan() throws SQLException {
}
try (Connection conn = DriverManager.getConnection(
JDBC_PROTOCOL + JDBC_PROTOCOL_SEPARATOR + zkUrl,
overridingProps)) {
+ PTable pTable = PTable.parseFrom(options.getTableBytes());
+ org.apache.phoenix.schema.PTable table =
PTableImpl.createFromProto(pTable);
+ PhoenixConnection phoenixConnection =
conn.unwrap(PhoenixConnection.class);
+ phoenixConnection.addTable(table, System.currentTimeMillis());
Review Comment:
> Can you think of a case of when the jobs are delayed enough for this to
matter, but enough of them start up synchrounously for the generated load to be
a problem (I don't know enough about Spark to tell) ?
This used to be somethign that would happen in MapReduce with lots of
mappers (where the state of the world might change from job submissions until
the last mappers actually got scheduled and ran). Honestly, I don't think it's
something we can really optimize anyways (and it's not the "common" path that
people would be running `alter table` commands multiple times a day).
> the jobs will not be started immediately, and the syscat load is not
really a problem.
That's a good point too. I didn't think about that side.
> Avoid unnecessary calls of fetching table meta data to region servers holding
> the system tables in batch oriented jobs in spark or hive otherwise those RS
> become hotspot
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-6694
> URL: https://issues.apache.org/jira/browse/PHOENIX-6694
> Project: Phoenix
> Issue Type: Task
> Components: hive-connector, spark-connector
> Reporter: Rajeshbabu Chintaguntla
> Assignee: Rajeshbabu Chintaguntla
> Priority: Major
>
> Currently we are preparing the query plan in both data source and partition
> readers which is creating new connection in each worker and job
> initialisation which unnecessarily touch basing all both system catalog
> table, system stats table as well as meta. When there are jobs with millions
> of parallel workers hotspot the region servers holding the meta and system
> catalog as well system stats table. So if we share the same query plan
> between the workers which can avoid the hotspot.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)