[ 
https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856766#comment-16856766
 ] 

Wellington Chevreuil commented on HBASE-22539:
----------------------------------------------

Thanks everyone for all the insights. Addressing comments below:

[~openinx]
{quote}It's strange that ByteBufferUtils.copyFromBufferToArray would made the 
difference.
{quote}
My suspicion here is that *ByteBufferUtils.copyFromBufferToArray* would rely on 
*Unsafe.copyMemory*.
{quote}Still got the corruption after applied HBASE-22496 in your clusters, you 
mean ? In theory, HBASE-22496 was a risk for corruption..
{quote}
We had not tested this change yet, but as far I can see, it would just produce 
additional unnecessary iterative copies if the total amount of bytes to be 
copied is greater than *UnsafeAccess.UNSAFE_COPY_THRESHOLD* (which is currently 
hardcoded at 1MB). In the context of how *ByteBufferWriterOutputStream.write* 
calls *ByteBufferUtils.copyFromBufferToArray*, there would only be 1 iteration 
anyway, as the total bytes to be copied is capped by 
*ByteBufferWriterOutputStream.DEFAULT_BUFFER_SIZE* (4k).

 

[~ram_krish]
{quote}Have you verified that all the calls that come to the 
ByteBufferWriterStream#write() has len which is always less than the buffSize?
 Because if there is something wrong there - then the sanity code that you have 
where you directly read from the ByteBuffer 'b' to testBuf will work fine but 
not the other one.
{quote}
I guess you mean [this 
copy|https://github.com/wchevreuil/hbase/blob/HBASE-22539/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L96]
 would mess up the byte[], as I'm (mistakenly) passing 0 as source offset. So 
_this.bufSize_ is set by *ByteBufferWriterOutputStream.DEFAULT_BUFFER_SIZE* 
(4096). In the error mentioned on the original jira description, the given 
array length was 2098.

 

[~Apache9]
  
{quote}And in fact, a DBB.get is also an Unsafe.copyMemory, you can see the 
implementation in DirectByteBuffer.get, it just calls Bits.copyToArray, which 
is almost the same with our ByteBufferUtils.copyFromBufferToArray...
{quote}

 Well pointed! Maybe the _DBB.get_ call succeeded by chance, in this occasion? 
I'm not aware of the internals from *Unsafe.copyMemory* implementation, sounds 
unlikely for such method not being thread safe, but could it be due to race 
conditions? I noticed some compactions running around that time, and making 
calls to *Unsafe.copyMemory*. 

Also, should we just use _DBB.get_, instead of _UnsafeAccess.unsafeCopy_?

 

> Potential WAL corruption due to Unsafe.copyMemory usage when DBB are in place
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-22539
>                 URL: https://issues.apache.org/jira/browse/HBASE-22539
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc, wal
>    Affects Versions: 2.1.1
>            Reporter: Wellington Chevreuil
>            Priority: Blocker
>
> Summary
> We had been chasing a WAL corruption issue reported on one of our customers 
> deployments running release 2.1.1 (CDH 6.1.0). After providing a custom 
> modified jar with the extra sanity checks implemented by HBASE-21401 applied 
> on some code points, plus additional debugging messages, we believe it is 
> related to DirectByteBuffer usage, and Unsafe copy from offheap memory to 
> on-heap array triggered 
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157],
>  such as when writing into a non ByteBufferWriter type, as done 
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84].
> More details on the following comment.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to