[ 
https://issues.apache.org/jira/browse/HBASE-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199403#comment-16199403
 ] 

Andrew Purtell edited comment on HBASE-18786 at 10/10/17 9:28 PM:
------------------------------------------------------------------

What this patch removed was a silent rescan of storefiles if there is some kind 
of glitch leading to a FNFE during scanning.  If there was a temporary store 
file accounting failure then after one server aborts and another picks it up, 
the new server will not see another FNFE, and so this is not a cascading 
failure condition. 

If there is a permanent condition which could lead to a cascade, like hard loss 
of files or directories in HDFS, leading to FNFEs and a cascading failure 
situation, then I don't see how rescanning would help, and anyway we should 
handle it differently. Previously we would have silently opened the region with 
missing files (?). That would be bad. Aborting would be bad too in that case. 
Rather than aborting we should fail the region open. This should be handled 
with a new JIRA.


was (Author: apurtell):
What this patch removed was a silent rescan of storefiles if there is some kind 
of glitch leading to a FNFE during scanning.  If there was a temporary store 
file accounting failure then after one server aborts and another picks it up, 
the new server will not see another FNFE. 

If there is a permanent condition, like loss of files or directories in HDFS, 
leading to FNFEs and a cascading failure situation, then I don't see how 
rescanning would help, and anyway we should handle it differently. Previously 
we would have silently opened the region with missing files (?). That would be 
bad. Aborting would be bad too in that case. Rather than aborting we should 
fail the region open. This should be handled with a new JIRA.

> FileNotFoundException should not be silently handled for primary region 
> replicas
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-18786
>                 URL: https://issues.apache.org/jira/browse/HBASE-18786
>             Project: HBase
>          Issue Type: Sub-task
>          Components: regionserver, Scanners
>            Reporter: Ashu Pachauri
>            Assignee: Andrew Purtell
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
>         Attachments: HBASE-18786-branch-1.3.patch, 
> HBASE-18786-branch-1.patch, HBASE-18786-branch-1.patch, HBASE-18786.patch, 
> HBASE-18786.patch
>
>
> This is a follow up for HBASE-18186.
> FileNotFoundException while scanning from a primary region replica can be 
> indicative of a more severe problem. Handling them silently can cause many 
> underlying issues go undetected. We should either
> 1. Hard fail the regionserver if there is a FNFE on a primary region replica, 
> OR
> 2. Report these exceptions as some region / server level metric so that these 
> can be proactively investigated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to