[jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region

Ruben Aguiar (JIRA) Wed, 25 Mar 2015 10:52:12 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380350#comment-14380350
 ]


Ruben Aguiar commented on HBASE-13329:
--------------------------------------

It seems this is not an error due to the data, but probably related to the 
metrics/tags in openTSDB. After changing the script to run by sending random 
data between 1 and 10, and even upping the region server itself (ran into some 
out of heap space errors) the error still persists. Probably because I'm still 
writing to the same metric. Note however, that this time I've noticed some 
further behaviours. First of all, on the process of flushing the region, it 
continued to say running without ever timing out or finishing. During this 
behaviour, a file in 
/hbase/data/default/tsdb/317da7fabf9ea9b15de80377bb792cd8/.tmp kept increasing 
in size until I killed the region server. On a stop-hbase command however, the 
logs repeatedly showed this error:

2015-03-25 17:22:55,731 INFO  [regionserver//10.2.0.73:16020] 
regionserver.HRegionServer: Received CLOSE for the region: 317da7fabf9ea9b
15de80377bb792cd8, which we are already trying to CLOSE, but not completed yet
2015-03-25 17:22:55,731 WARN  [regionserver//10.2.0.73:16020] 
regionserver.HRegionServer: Failed to close tsdb,,1427300108453.317da7fabf
9ea9b15de80377bb792cd8. - ignoring and continuing
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: The 
region 317da7fabf9ea9b15de80377bb792cd8 was already closing
. New CLOSE request is ignored.
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.closeRegion(HRegionServer.java:2635)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.closeRegionIgnoreErrors(HRegionServer.java:2561)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.closeUserRegions(HRegionServer.java:2219)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:869)
        at java.lang.Thread.run(Thread.java:745)
 
I then issued a kill command on the region server and restarted the cluster. It 
behaved normally, and even deleted the huge tmp file previously generated. 
Until after a while (2-3 minutes) this error appeared on the region server logs:

2015-03-25 17:31:04,752 ERROR 
[regionserver//10.2.0.73:16020-shortCompactions-1427304663929] 
regionserver.CompactSplitThread: Compaction failed Request = 
regionName=tsdb,,1427300108453.317da7fabf9ea9b15de80377bb792cd8., storeName=t, 
fileCount=6, fileSize=228.1 M (107.9 M, 11.9 M, 30.8 M, 25.0 M, 44.2 M, 8.5 M), 
priority=4, time=9704521299920
java.lang.ArrayIndexOutOfBoundsException: -12190
        at 
org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478)
        at 
org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
        at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
        at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
        at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
        at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
        at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:254)
        at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:102)
        at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:110)
        at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1167)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1610)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:511)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

This made the region server crash after a while (no writings were occuring at 
the time):

2015-03-25 17:35:42,379 WARN  [RS_CLOSE_REGION-10.2.0.73:16020-2] 
regionserver.DefaultMemStore: Snapshot called again without clearing previous. 
Doing nothing. Another ongoing flush or did we fail last attempt?
2015-03-25 17:35:42,379 WARN  [B.defaultRpcServer.handler=9,queue=0,port=16020] 
ipc.RpcServer: (responseTooSlow): 
{"processingtimems":23059,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"10.2.0.74:38861","starttimems":1427304919320,"queuetimems":0,"class":"HRegionServer","responsesize":95,"method":"Scan"}
2015-03-25 17:35:42,701 FATAL [RS_CLOSE_REGION-10.2.0.73:16020-2] 
regionserver.HRegionServer: ABORTING region server 
10.2.0.73,16020,1427304618843: Unrecoverable exception while closing region 
tsdb,,1427300108453.317da7fabf9ea9b15de80377bb792cd8., still finishing close
org.apache.hadoop.hbase.DroppedSnapshotException: region: 
tsdb,,1427300108453.317da7fabf9ea9b15de80377bb792cd8.
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1266)
        at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1198)
        at 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException

I'm still very new to the whole hbase environment but it seems to me the 
problem is the same as I reported previously, but in this case happened on a 
Compaction. The first time it showed unresponsive, the same probably happened 
while flushing, and it kept retrying (probably generating and increasing that 
tmp). After the kill, the error happened on a compaction. If I can help in any 
way to debug this problem please let me know. Any help is very appreciated.


> Memstore flush fails if data has always the same value, breaking the region
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-13329
>                 URL: https://issues.apache.org/jira/browse/HBASE-13329
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.0.1
>         Environment: linux-debian-jessie
> ec2 - t2.micro instances
>            Reporter: Ruben Aguiar
>            Priority: Critical
>             Fix For: 2.0.0, 1.1.0
>
>
> While trying to benchmark my opentsdb cluster, I've created a script that 
> sends to hbase always the same value (in this case 1). After a few minutes, 
> the whole region server crashes and the region itself becomes impossible to 
> open again (cannot assign or unassign). After some investigation, what I saw 
> on the logs is that when a Memstore flush is called on a large region (128mb) 
> the process errors, killing the regionserver. On restart, replaying the edits 
> generates the same error, making the region unavailable. Tried to manually 
> unassign, assign or close_region. That didn't work because the code that 
> reads/replays it crashes.
> From my investigation this seems to be an overflow issue. The logs show that 
> the function getMinimumMidpointArray tried to access index -32743 of an 
> array, extremely close to the minimum short value in Java. Upon investigation 
> of the source code, it seems an index short is used, being incremented as 
> long as the two vectors are the same, probably making it overflow on large 
> vectors with equal data. Changing it to int should solve the problem.
> Here follows the hadoop logs of when the regionserver went down. Any help is 
> appreciated. Any other information you need please do tell me:
> 2015-03-24 18:00:56,187 INFO  [regionserver//10.2.0.73:16020.logRoller] 
> wal.FSHLog: Rolled WAL 
> /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516
>  with entries=143, filesize=134.70 MB; new WAL 
> /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140
> 2015-03-24 18:00:56,188 INFO  [regionserver//10.2.0.73:16020.logRoller] 
> wal.FSHLog: Archiving 
> hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709
>  to 
> hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709
> 2015-03-24 18:04:35,722 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
> Started memstore flush for 
> tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region 
> memstore size 128.04 MB
> 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: 
> ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. 
> Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2.
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743
>       at 
> org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478)
>       at 
> org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121)
>       at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
>       at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:879)
>       at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2128)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1953)
>       ... 7 more
> 2015-03-24 18:04:36,156 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: 
> RegionServer abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region

Reply via email to