[jira] [Updated] (PHOENIX-7002) Insufficient logging in phoenix client when server throws StaleRegionBoundaryCacheException.

Rushabh Shah (Jira) Mon, 24 Jul 2023 14:23:13 -0700


     [ 
https://issues.apache.org/jira/browse/PHOENIX-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rushabh Shah updated PHOENIX-7002:
----------------------------------
    Description: 
Saw an incident in production cluster where a phoenix range scan query returned 
result outside of the range provided by the customer. There were hbck repair 
runs going on while the query was running. During the start of the query, there 
were region holes in the table (no way to confirm) and while the query was 
still running we ran hbck repair operation and that caused region overlaps 
(This is confirmed since overlap continued after the query). 
But the sad part is there were absolutely no exceptions/errors/stack trace on 
the client or server side.
After the query is run we log the execution time, number of exception 
encountered as a log line. There we see this query encountered 
[StaleRegionBoundaryCacheException|https://github.com/apache/phoenix/blob/4.16/phoenix-core/src/main/java/org/apache/phoenix/monitoring/MetricType.java#L57]
 4 times.

There is some logic in BaseResultIterators where we adjust the start and end 
key range for the scan. See 
[here|https://github.com/apache/phoenix/blob/4.16/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L688-L730]

Without knowing the state of meta known or exception encountered, it is very 
difficult to debug why this happened.

At the very least, we would want to log all the exceptions on the phoenix 
client side.

  was:
Saw an incident in production cluster where a phoenix range scan query returned 
result outside of the range provided by the customer. There were hbck repair 
runs going on while the query was running. During the start of the query, there 
were region holes in the table (no way to confirm) and while the query was 
still running we ran hbck repair operation and that caused region overlaps 
(This is confirmed since overlap continued after the query). 
But the sad part is there were absolutely no exceptions/errors/stack trace on 
the client or server side.
After the query is run we log the execution time, number of exception 
encountered as a log line. There we see this query encountered 
[StaleRegionBoundaryCacheException|https://github.com/apache/phoenix/blob/4.16/phoenix-core/src/main/java/org/apache/phoenix/monitoring/MetricType.java#L57].

There is some logic in BaseResultIterators where we adjust the start and end 
key range for the scan. See 
[here|https://github.com/apache/phoenix/blob/4.16/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L688-L730]

Without knowing the state of meta known or exception encountered, it is very 
difficult to debug why this happened.

At the very least, we would want to log all the exceptions on the phoenix 
client side.


> Insufficient logging in phoenix client when server throws 
> StaleRegionBoundaryCacheException.
> --------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-7002
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7002
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> Saw an incident in production cluster where a phoenix range scan query 
> returned result outside of the range provided by the customer. There were 
> hbck repair runs going on while the query was running. During the start of 
> the query, there were region holes in the table (no way to confirm) and while 
> the query was still running we ran hbck repair operation and that caused 
> region overlaps (This is confirmed since overlap continued after the query). 
> But the sad part is there were absolutely no exceptions/errors/stack trace on 
> the client or server side.
> After the query is run we log the execution time, number of exception 
> encountered as a log line. There we see this query encountered 
> [StaleRegionBoundaryCacheException|https://github.com/apache/phoenix/blob/4.16/phoenix-core/src/main/java/org/apache/phoenix/monitoring/MetricType.java#L57]
>  4 times.
> There is some logic in BaseResultIterators where we adjust the start and end 
> key range for the scan. See 
> [here|https://github.com/apache/phoenix/blob/4.16/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L688-L730]
> Without knowing the state of meta known or exception encountered, it is very 
> difficult to debug why this happened.
> At the very least, we would want to log all the exceptions on the phoenix 
> client side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PHOENIX-7002) Insufficient logging in phoenix client when server throws StaleRegionBoundaryCacheException.

Reply via email to