fresh-borzoni commented on PR #3295:
URL: https://github.com/apache/fluss/pull/3295#issuecomment-4533118529

   I still don't like this RocksDB snapshot pinning, tbh. It feels like it's 
going to be a problem for wide partitioned tables with many buckets and 
concurrent users.
   
   Basically it's operational burden, two ways:
     1. Hard to figure out why scans fail. A single scan over >200 buckets per 
tablet server hits TOO_MANY_SCANNERS, the client retries 3× under 700 ms, can't 
recover (real scans don't drain in sub-second), and dies. The error doesn't say
     "per-server cap" -  operators see a generic exception and have to dig 
through server logs. Two concurrent scans interfering with each other is worse: 
nobody expects read traffic to throttle other read traffic.
     2. Writes degrade because memtables can't be flushed. Each held iterator 
forces the leader to keep memtables in memory until the scan ends. Memory 
pressure builds, compaction falls behind, write-path tail latency goes up. 
Operators see "P99 writes got slow during a batch read" with no causal trail in 
the logs from symptom back to the scan.
   
   And bumping kv.scanner.max-per-server past 200 doesn't really help - it just 
trades TOO_MANY_SCANNERS for memory pressure (several GB of pinning per server 
for many partitions/many buckets tables).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to