HeartSaVioR commented on a change in pull request #34502:
URL: https://github.com/apache/spark/pull/34502#discussion_r752576654



##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1956,8 +1956,21 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
     <td>Whether we resets all ticker and histogram stats for RocksDB on 
load.</td>
     <td>True</td>
   </tr>
+  <tr>
+    <td>spark.sql.streaming.stateStore.rocksdb.trackTotalNumberOfRows</td>
+    <td>Whether we track the total number of rows in state store. Please refer 
the details in <a href="#performance-aspect-considerations">Performance-aspect 
considerations</a>.</td>
+    <td>True</td>
+  </tr>
 </table>
 
+##### Performance-aspect considerations
+
+1. For write-heavy workloads, you may want to disable the track of total 
number of rows.

Review comment:
       `1.` is not a typo. Just wanted to reserve a space we would eventually 
add more. I'm not an expert of RocksDB so don't have insights to put some 
guides on tuning, but RocksDB itself seems to provide lots of things to tune so 
it may come up later.
   
   I agree that "write-heavy workloads" sounds unclear; basically it means 
higher amount of updates (write/delete) against state store. This cannot be 
inferred from the volume of inputs depending on the operator and window - if 
the input produces lots of state keys on streaming aggregation, then it's going 
to issue lots of writes against state store. If the input are huge but binds to 
a few windows, then a few writes against state store.
   
   Probably we can leverage the state metric "rows to update" and "rows to 
delete". They represent the amount of updates. Technically this change doesn't 
introduce perf. regression in any workloads so it's not limited to write-heavy 
workloads - we make a trade-off on observability so it's up to end users to 
choose performance vs observability.
   
   Looks like it'd be better to remove the representation "For write-heavy 
workloads" and simply add "to gain additional performance on state store", with 
hinting that it will be more effective if the state metric "rows to update" and 
"rows to delete" are high.
   
   Thanks for the inputs!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to