Re: [PR] PHOENIX-7485: Add metrics for time taken during mutation plan creation and execution [phoenix]

via GitHub Mon, 24 Feb 2025 01:12:46 -0800


sanjeet006py commented on PR #2061:
URL: https://github.com/apache/phoenix/pull/2061#issuecomment-2677810081


   > 1 ms was important for a huge time taken? How huge was the huge time?
   
   Extra 1ms was spent in every `executeUpdate()` call and a user was trying to 
insert 2M rows so, it ended up taking ~26 mins while if we remove that 1ms from 
`executeUpdate()` then 2M rows can be inserted within 24-25 secs. And, the 
intention was to ultimately insert 40M rows which would shown regression in 
hrs. The extra 1ms was being taken to get table metadata from SYSCAT.
   
   > Did the 1 ms include any metadata RPCs, and if so, should the metric be 
captured specifically for calls to SYSTEM.CATALOG or meta or something similar? 
In this way, we need not spend cycles measuring local work that is expected to 
be fast.
   
   The extra 1ms did come from time spent in call to SYSCAT to get table 
metadata and we can add a metric to specifically tell us if call to SYSCAT RS 
was done or not. But in my opinion a regression can be introduced in these 
paths at any time. This time the regression was extra metadata call but next 
time it can be something else. So, I think we can capture the time taken by 
stages to ensure that each stage is behaving as expected. 
   
   > Is this metric/log only going to be useful in the cases where we send 
RPCs, or do we think that we really are spending a lot of time locally in 
planning, without any network calls?
   
   I believe this metric can help us highlight any future regressions 
introduced in the path of mutation planning as this time the first task was to 
confirm that a regression has been introduced. Currently, because of lack of 
metrics for mutation planning stages we can't even tell if a regression has 
been introduced.
   
   > Any JVM pause for GC or otherwise could likely last longer than 0.5 
milliseconds, so the log message, if that's the choice, shouldn't be misleading 
that it is some kind of error or egregious performance scenario.
   
   Yeah, I plan to log a generic warning message that mutation planning took x 
time compared to some expected y threshold and I also plan to make that 
threshold configurable on client side via a config.
   
   > I suppose we do want to know metrics around how many RPCs are required to 
serve the request, especially when those include additional RPCs like system 
RPCs, which may not always be required. But those are more difficult to 
instrument, and so we are choosing to instrument mutation planning, only 
because it's a top-level "span" and we don't need to plumb the instrumentation 
all the way down through the code?
   
   Actually, I didn't thought of adding a metric to measure if RPC call is 
being done to SYSCAT to get metadata as my aim was to have a way to tell in 
future that regression has been introduced in mutation planning stage as its a 
very hot method call. Having such a metric can help quickly root cause. So, 
once we know that a regression has been introduced in mutation planning then we 
can check such metric to quickly root cause. WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] PHOENIX-7485: Add metrics for time taken during mutation plan creation and execution [phoenix]

Reply via email to