[
https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856766#comment-16856766
]
Wellington Chevreuil commented on HBASE-22539:
----------------------------------------------
Thanks everyone for all the insights. Addressing comments below:
[~openinx]
{quote}It's strange that ByteBufferUtils.copyFromBufferToArray would made the
difference.
{quote}
My suspicion here is that *ByteBufferUtils.copyFromBufferToArray* would rely on
*Unsafe.copyMemory*.
{quote}Still got the corruption after applied HBASE-22496 in your clusters, you
mean ? In theory, HBASE-22496 was a risk for corruption..
{quote}
We had not tested this change yet, but as far I can see, it would just produce
additional unnecessary iterative copies if the total amount of bytes to be
copied is greater than *UnsafeAccess.UNSAFE_COPY_THRESHOLD* (which is currently
hardcoded at 1MB). In the context of how *ByteBufferWriterOutputStream.write*
calls *ByteBufferUtils.copyFromBufferToArray*, there would only be 1 iteration
anyway, as the total bytes to be copied is capped by
*ByteBufferWriterOutputStream.DEFAULT_BUFFER_SIZE* (4k).
[~ram_krish]
{quote}Have you verified that all the calls that come to the
ByteBufferWriterStream#write() has len which is always less than the buffSize?
Because if there is something wrong there - then the sanity code that you have
where you directly read from the ByteBuffer 'b' to testBuf will work fine but
not the other one.
{quote}
I guess you mean [this
copy|https://github.com/wchevreuil/hbase/blob/HBASE-22539/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L96]
would mess up the byte[], as I'm (mistakenly) passing 0 as source offset. So
_this.bufSize_ is set by *ByteBufferWriterOutputStream.DEFAULT_BUFFER_SIZE*
(4096). In the error mentioned on the original jira description, the given
array length was 2098.
[~Apache9]
{quote}And in fact, a DBB.get is also an Unsafe.copyMemory, you can see the
implementation in DirectByteBuffer.get, it just calls Bits.copyToArray, which
is almost the same with our ByteBufferUtils.copyFromBufferToArray...
{quote}
Well pointed! Maybe the _DBB.get_ call succeeded by chance, in this occasion?
I'm not aware of the internals from *Unsafe.copyMemory* implementation, sounds
unlikely for such method not being thread safe, but could it be due to race
conditions? I noticed some compactions running around that time, and making
calls to *Unsafe.copyMemory*.
Also, should we just use _DBB.get_, instead of _UnsafeAccess.unsafeCopy_?
> Potential WAL corruption due to Unsafe.copyMemory usage when DBB are in place
> -----------------------------------------------------------------------------
>
> Key: HBASE-22539
> URL: https://issues.apache.org/jira/browse/HBASE-22539
> Project: HBase
> Issue Type: Bug
> Components: rpc, wal
> Affects Versions: 2.1.1
> Reporter: Wellington Chevreuil
> Priority: Blocker
>
> Summary
> We had been chasing a WAL corruption issue reported on one of our customers
> deployments running release 2.1.1 (CDH 6.1.0). After providing a custom
> modified jar with the extra sanity checks implemented by HBASE-21401 applied
> on some code points, plus additional debugging messages, we believe it is
> related to DirectByteBuffer usage, and Unsafe copy from offheap memory to
> on-heap array triggered
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157],
> such as when writing into a non ByteBufferWriter type, as done
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84].
> More details on the following comment.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)