[
https://issues.apache.org/jira/browse/HDFS-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111085#comment-14111085
]
Todd Lipcon commented on HDFS-6865:
-----------------------------------
Thanks for doing the diligence on the performance tests. Looks like this will
be a good speedup across the board. A few comments:
- In the FSOutputSummer constructor, aren't checksumSize and maxChunkSize now
redundant with the DataChecksum object that's passed in? {{checksumSize}}
should be the same as {{sum.getChecksumSize()}} and {{maxChunkSize}} should be
the same as {{sum.getBytesPerChecksum()}}, no?
- Similarly, in the FSOutputSummer class, it seems like the member variables of
the same names are redundantr with the {{sum}} member variable.
- Can you mark {{sum}} as {{final}} in FSOutputSummer?
- Shouldn't BUFFER_NUM_CHUNKS be a multiple of 3, since we calculate three
chunks worth in parallel in the native code? (worth a comment explaining the
choice, too)
----
{code}
private int write1(byte b[], int off, int len) throws IOException {
if(count==0 && len>=buf.length) {
// local buffer is empty and user data has one chunk
// checksum and output data
{code}
This comment is no longer accurate, right? The condition is now that the user
data has provided data at least as long as our internal buffer.
----
- {{writeChecksumChunk}} should probably be renamed to {{writeChecksumChunks}}
and its javadoc get updated.
- It's a little weird that you loop over {{writeChunk}} and pass a single chunk
per call, though you actually have data ready for multiple chunks, and the API
itself seems to be perfectly suitable to pass all of the chunks at once. Did
you want to leave this as a later potential optimization?
----
{code}
writeChunk(b, off + i, Math.min(maxChunkSize, len - i), checksum,
i / maxChunkSize * checksumSize, checksumSize);
{code}
This code might be a little easier to read if you made some local variables:
{code}
int rem = Math.min(maxChunkSize, len - i);
int ckOffset = i / maxChunkSize * checksumSize;
writeChunk(b, off + i, rem, checksum, ckOffset, checksumSize);
{code}
----
{code}
/* Forces any buffered output bytes to be checksumed and written out to
* the underlying output stream. If keep is true, then the state of
* this object remains intact.
{code}
This comment is now inaccurate. If {{keep}} is true, then it retains only the
last partial chunk worth of buffered data.
----
- The {{setNumChunksToBuffer}} static thing is kind of sketchy. What if,
instead, you implemented flush() in FSOutputSummer such that it always flushed
all completed chunks? (and not any partial last chunk). Then you could make
those tests call flush() before checkFile(), and not have to break any
abstractions?
> Byte array native checksumming on client side (HDFS changes)
> ------------------------------------------------------------
>
> Key: HDFS-6865
> URL: https://issues.apache.org/jira/browse/HDFS-6865
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client, performance
> Reporter: James Thomas
> Assignee: James Thomas
> Attachments: HDFS-6865.2.patch, HDFS-6865.3.patch, HDFS-6865.4.patch,
> HDFS-6865.5.patch, HDFS-6865.patch
>
>
> Refactor FSOutputSummer to buffer data and use the native checksum
> calculation functionality introduced in HADOOP-10975.
--
This message was sent by Atlassian JIRA
(v6.2#6252)