liutang123 opened a new issue, #16356:
URL: https://github.com/apache/doris/issues/16356

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   Now, query/compaction/balance/consistenct check will fail due to disk error.
   There are two types of disk types:
   1. The whole disk is broken.
      BE can sense this case. But due to the delay of tablet report, many 
queries may be sent to this BE and many queries may fail.
   2. There are disk bad sectors or disk bad track.
       FE can not sense this case. queries on the bad replicas will fail. And 
this case may not be automatically repaired.
   
   ### Solution
   
   ## sense io error
   1. sense io error when query, and record the number of io errors in BE.
   2. record the number of io errors when doing compaction.
   3. sense io error when doing clone and select ohter be as source. Record the 
number of io errors in BE.
   4. Record the numbers of io errors when doing consistency check.
   5. Doing checksum periodically in BE.
   6. When io error occurs multiple times, set it as bad and report it to FE 
master.
   7. When finds one disk is completely broken, BE will drop all replicas on it 
and report to FE master. We can increase the priority of disk report to prevent 
it from being limited by report queue size.
   
   ## query fault tolerance
   1. When FE finds io errors when doing query, put the BE to a black list in 
coordinator and retry this query without this BE.
   2. Optimize(TODO). Add a new `Greylist` class or new BE state. When FE 
master find that the number of disks in BE becomes smaller, mark it as unhealty 
and try not to use it when querying. When tablet diff (fe - be) is smaller than 
a limit, mark it as normal.
   
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to