[ 
https://issues.apache.org/jira/browse/HBASE-24420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuchang updated HBASE-24420:
----------------------------
    Description: 
In https://issues.apache.org/jira/browse/HBASE-14541, the retry logic changed 
from a configurable retry times(by configuration item 
hbase.bulkload.retries.number) to below retry logic to process the issue that 
the RegionSplit happened during bulk load:

```
int maxRetries = getConf().getInt("hbase.bulkload.retries.number", 10);
maxRetries = Math.max(maxRetries, startEndKeys.getFirst().length + 1);
if (maxRetries != 0 && count >= maxRetries) {
 throw new IOException("Retry attempted " + count +
 " times without completing, bailing out");
}
```
This issue caused another issue in our cluster, that is:
Our table has 2000 regions and our bulk load failed for an configuration issue:
This failure is caused by a client mis-configuration and cause the RegionServer 
fail to load the HFile, but the response is not thrown as exception to client, 
but marked as a variable `loaded` in the `BulkLoadHFileResponse`
```
BulkLoadHFileResponse {
 required bool loaded = 1;
}
```
After this failure, the bulk load fall into a retry disaster and after the 
retry reached about 200, our HDFS crashed for OOM.

During with, the HBase table splits never happened;

I think the patch in HBASE-14541 didn't handle the unrecoverable retry case and 
in this case(I think many reason may incur unrecoverable retry) the meaningless 
retry attempts becomes disaster and is un-configurable because we cannot change 
the Region number of our table;

  was:
In https://issues.apache.org/jira/browse/HBASE-14541, the retry logic changed 
from a configurable retry times(by configuration item 
hbase.bulkload.retries.number)  to below retry logic to process the issue that 
the RegionSplit happened during bulk load:
{code:java}
int maxRetries = getConf().getInt("hbase.bulkload.retries.number", 10);
maxRetries = Math.max(maxRetries, startEndKeys.getFirst().length + 1);
if (maxRetries != 0 && count >= maxRetries) {
  throw new IOException("Retry attempted " + count +
    " times without completing, bailing out");
}
{code}
This issue caused another issue in our cluster, that is:

Our table has 2000 regions and our bulk load failed for an configuration 
issue(unrelated with this case, so ignore the failure reason) and then ,the 
bulk load fall into a retry disaster and after retry reached about 200, our 
HDFS crashed for OOM.

During with, the HBase table splits never happened;

I think the patch in HBASE-14541 didn't handle the unrecoverable retry case and 
in this case(I think many reason may incur unrecoverable retry) the meaningless 
retry attempts  becomes disaster and is un-configurable because we cannot 
change the Region number of our table;

 


> BulkLoad Will Fall Into Unbelievable Retry Attempt in Some case
> ---------------------------------------------------------------
>
>                 Key: HBASE-24420
>                 URL: https://issues.apache.org/jira/browse/HBASE-24420
>             Project: HBase
>          Issue Type: Bug
>            Reporter: wuchang
>            Priority: Major
>
> In https://issues.apache.org/jira/browse/HBASE-14541, the retry logic changed 
> from a configurable retry times(by configuration item 
> hbase.bulkload.retries.number) to below retry logic to process the issue that 
> the RegionSplit happened during bulk load:
> ```
> int maxRetries = getConf().getInt("hbase.bulkload.retries.number", 10);
> maxRetries = Math.max(maxRetries, startEndKeys.getFirst().length + 1);
> if (maxRetries != 0 && count >= maxRetries) {
>  throw new IOException("Retry attempted " + count +
>  " times without completing, bailing out");
> }
> ```
> This issue caused another issue in our cluster, that is:
> Our table has 2000 regions and our bulk load failed for an configuration 
> issue:
> This failure is caused by a client mis-configuration and cause the 
> RegionServer fail to load the HFile, but the response is not thrown as 
> exception to client, but marked as a variable `loaded` in the 
> `BulkLoadHFileResponse`
> ```
> BulkLoadHFileResponse {
>  required bool loaded = 1;
> }
> ```
> After this failure, the bulk load fall into a retry disaster and after the 
> retry reached about 200, our HDFS crashed for OOM.
> During with, the HBase table splits never happened;
> I think the patch in HBASE-14541 didn't handle the unrecoverable retry case 
> and in this case(I think many reason may incur unrecoverable retry) the 
> meaningless retry attempts becomes disaster and is un-configurable because we 
> cannot change the Region number of our table;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to