[jira] [Updated] (HBASE-21183) loadincrementalHFiles sometimes throws FileNotFoundException on retry

Tim Robertson (JIRA) Tue, 11 Sep 2018 07:04:32 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Robertson updated HBASE-21183:
----------------------------------
    Description: 
On a nightly batch job which prepares 100s of well balanced HFiles at around 
2GB each, we see sporadic failures in a bulk load. 

I'm unable to paste the logs here (different network) but they show e.g. the 
following on a failing day:
{code:java}
Trying to load hfile... /my/input/path/...
Attempt to bulk load region containing ... failed. This is recoverable and will 
be retried
Attempt to bulk load region containing ... failed. This is recoverable and will 
be retried
Attempt to bulk load region containing ... failed. This is recoverable and will 
be retried
Split occurred while grouping HFiles, retry attempt 1 with 3 files remaining to 
group or split
Trying to load hfile...
IOException during splitting
java.io.FileNotFoundException: File does not exist: /my/input/path/...
{code}
The exception get's thrown from [this 
line|https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L685].
  
 I should note that this is a secure cluster (CDH 5.12.x).

I've tried to go through the code, and don't spot an obvious race condition. I 
don't spot any changes related to this for the later 1.x versions so presume 
this exists in 1.5.

I'm yet to get access to the NameNode audit logs when this occurs to trace 
through the rename() calls around these particular files.

I don't see timeouts like HBASE-4030

  was:
On a nightly batch job which prepares 100s of well balanced HFiles at around 
2GB each, we see sporadic failures in a bulk load. 

I'm unable to paste the logs here (different network) but they show e.g. the 
following on a failing day:
{code}
Trying to load hfile... /my/input/path/...
Attempt to bulk load region containing ... failed. This is recoverable and will 
be retried
Attempt to bulk load region containing ... failed. This is recoverable and will 
be retried
Attempt to bulk load region containing ... failed. This is recoverable and will 
be retried
Split occurred while grouping HFiles, retry attempt 1 with 3 files remaining to 
group or split
Trying to load hfile...
IOException during splitting
java.io.FileNotFoundException: File does not exist: /my/input/path/...
{code}
The exception get's thrown from [this 
line|https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L685].
  
 I should note that this is a secure cluster (CDH 5.12.x).

I've tried to go through the code, and don't spot an obvious race condition. I 
don't spot any changes related to this for the later 1.x versions so presume 
this exists in 1.5.

I'm yet to get access to the NameNode audit logs when this occurs to trace 
through the rename() calls around these particular files.


> loadincrementalHFiles sometimes throws FileNotFoundException on retry
> ---------------------------------------------------------------------
>
>                 Key: HBASE-21183
>                 URL: https://issues.apache.org/jira/browse/HBASE-21183
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: Tim Robertson
>            Priority: Major
>
> On a nightly batch job which prepares 100s of well balanced HFiles at around 
> 2GB each, we see sporadic failures in a bulk load. 
> I'm unable to paste the logs here (different network) but they show e.g. the 
> following on a failing day:
> {code:java}
> Trying to load hfile... /my/input/path/...
> Attempt to bulk load region containing ... failed. This is recoverable and 
> will be retried
> Attempt to bulk load region containing ... failed. This is recoverable and 
> will be retried
> Attempt to bulk load region containing ... failed. This is recoverable and 
> will be retried
> Split occurred while grouping HFiles, retry attempt 1 with 3 files remaining 
> to group or split
> Trying to load hfile...
> IOException during splitting
> java.io.FileNotFoundException: File does not exist: /my/input/path/...
> {code}
> The exception get's thrown from [this 
> line|https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L685].
>   
>  I should note that this is a secure cluster (CDH 5.12.x).
> I've tried to go through the code, and don't spot an obvious race condition. 
> I don't spot any changes related to this for the later 1.x versions so 
> presume this exists in 1.5.
> I'm yet to get access to the NameNode audit logs when this occurs to trace 
> through the rename() calls around these particular files.
> I don't see timeouts like HBASE-4030



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21183) loadincrementalHFiles sometimes throws FileNotFoundException on retry

Reply via email to