[ 
https://issues.apache.org/jira/browse/PHOENIX-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537780#comment-17537780
 ] 

ASF GitHub Bot commented on PHOENIX-6694:
-----------------------------------------

joshelser commented on code in PR #80:
URL: https://github.com/apache/phoenix-connectors/pull/80#discussion_r874129143


##########
phoenix-spark-base/src/main/java/org/apache/phoenix/spark/datasource/v2/reader/PhoenixInputPartitionReader.java:
##########
@@ -94,6 +99,10 @@ private QueryPlan getQueryPlan() throws SQLException {
         }
         try (Connection conn = DriverManager.getConnection(
                 JDBC_PROTOCOL + JDBC_PROTOCOL_SEPARATOR + zkUrl, 
overridingProps)) {
+            PTable pTable = PTable.parseFrom(options.getTableBytes());
+            org.apache.phoenix.schema.PTable table = 
PTableImpl.createFromProto(pTable);
+            PhoenixConnection phoenixConnection = 
conn.unwrap(PhoenixConnection.class);
+            phoenixConnection.addTable(table, System.currentTimeMillis());

Review Comment:
   Please add a log message that we are adding a cached version of the PTable 
here (rather than resolving the real table).
   
   On the off chance that there is some problem (from when we read the table 
until when we ran this code), it would be good to remind anyone looking at the 
log that we're doing this trick.



##########
phoenix-spark-base/src/main/java/org/apache/phoenix/spark/datasource/v2/reader/PhoenixInputPartitionReader.java:
##########
@@ -81,7 +86,7 @@ Properties getOverriddenPropsFromOptions() {
         return options.getOverriddenProps();
     }
 
-    private QueryPlan getQueryPlan() throws SQLException {
+    private QueryPlan getQueryPlan() throws SQLException, 
InvalidProtocolBufferException {

Review Comment:
   I think you could catch the `InvalidProtocolBufferException` and throw it as 
a RuntimeException in `getQueryPlan()`, rather than doing it in `initialize()`



##########
phoenix-spark-base/src/main/java/org/apache/phoenix/spark/datasource/v2/reader/PhoenixInputPartitionReader.java:
##########
@@ -94,6 +99,10 @@ private QueryPlan getQueryPlan() throws SQLException {
         }
         try (Connection conn = DriverManager.getConnection(
                 JDBC_PROTOCOL + JDBC_PROTOCOL_SEPARATOR + zkUrl, 
overridingProps)) {
+            PTable pTable = PTable.parseFrom(options.getTableBytes());
+            org.apache.phoenix.schema.PTable table = 
PTableImpl.createFromProto(pTable);
+            PhoenixConnection phoenixConnection = 
conn.unwrap(PhoenixConnection.class);
+            phoenixConnection.addTable(table, System.currentTimeMillis());

Review Comment:
   Also, do we want to record the time in which the PTable was cached and use 
that rather than `System.currentTimeMillis()`?
   
   My thinking is probably "no". If we use the original time we read the PTable 
and a long time elapsed (really big job, starvation on the Executor side), it 
could be hour(s) until this code is actually executed. If the 
`UPDATE_CACHE_FREQUENCY` isn't increased to a large value, it might cause the 
Phoenix client to re-fetch the PTable again? (just guessing).
   
   Have you thought about this case already?





> Avoid unnecessary calls of fetching table meta data to region servers holding 
> the system tables in batch oriented jobs in spark or hive otherwise those RS 
> become hotspot
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-6694
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6694
>             Project: Phoenix
>          Issue Type: Task
>          Components: hive-connector, spark-connector
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Rajeshbabu Chintaguntla
>            Priority: Major
>
> Currently we are preparing the query plan in both data source and partition 
> readers which is creating new connection in each worker and job 
> initialisation  which unnecessarily  touch basing all both system catalog 
> table, system stats table as well as meta. When there are jobs with millions 
> of parallel workers hotspot the region servers holding the meta and system 
> catalog as well system stats table. So if we share the same query plan 
> between the workers which can avoid the hotspot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to