[jira] [Commented] (HDFS-6865) Byte array native checksumming on client side (HDFS changes)

Todd Lipcon (JIRA) Tue, 26 Aug 2014 11:22:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111085#comment-14111085
 ]


Todd Lipcon commented on HDFS-6865:
-----------------------------------

Thanks for doing the diligence on the performance tests. Looks like this will 
be a good speedup across the board. A few comments:

- In the FSOutputSummer constructor, aren't checksumSize and maxChunkSize now 
redundant with the DataChecksum object that's passed in? {{checksumSize}} 
should be the same as {{sum.getChecksumSize()}} and {{maxChunkSize}} should be 
the same as {{sum.getBytesPerChecksum()}}, no?

- Similarly, in the FSOutputSummer class, it seems like the member variables of 
the same names are redundantr with the {{sum}} member variable.

- Can you mark {{sum}} as {{final}} in FSOutputSummer?

- Shouldn't BUFFER_NUM_CHUNKS be a multiple of 3, since we calculate three 
chunks worth in parallel in the native code? (worth a comment explaining the 
choice, too)

----

{code}
  private int write1(byte b[], int off, int len) throws IOException {
    if(count==0 && len>=buf.length) {
      // local buffer is empty and user data has one chunk
      // checksum and output data
{code}

This comment is no longer accurate, right? The condition is now that the user 
data has provided data at least as long as our internal buffer.

----

- {{writeChecksumChunk}} should probably be renamed to {{writeChecksumChunks}} 
and its javadoc get updated.

- It's a little weird that you loop over {{writeChunk}} and pass a single chunk 
per call, though you actually have data ready for multiple chunks, and the API 
itself seems to be perfectly suitable to pass all of the chunks at once. Did 
you want to leave this as a later potential optimization?

----

{code}
      writeChunk(b, off + i, Math.min(maxChunkSize, len - i), checksum,
          i / maxChunkSize * checksumSize, checksumSize);
{code}

This code might be a little easier to read if you made some local variables:

{code}
      int rem = Math.min(maxChunkSize, len - i);
      int ckOffset = i / maxChunkSize * checksumSize;
      writeChunk(b, off + i, rem, checksum, ckOffset, checksumSize);
{code}

----

{code}
  /* Forces any buffered output bytes to be checksumed and written out to
   * the underlying output stream.  If keep is true, then the state of 
   * this object remains intact.
{code}

This comment is now inaccurate. If {{keep}} is true, then it retains only the 
last partial chunk worth of buffered data.

----

- The {{setNumChunksToBuffer}} static thing is kind of sketchy. What if, 
instead, you implemented flush() in FSOutputSummer such that it always flushed 
all completed chunks? (and not any partial last chunk). Then you could make 
those tests call flush() before checkFile(), and not have to break any 
abstractions?


> Byte array native checksumming on client side (HDFS changes)
> ------------------------------------------------------------
>
>                 Key: HDFS-6865
>                 URL: https://issues.apache.org/jira/browse/HDFS-6865
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client, performance
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: HDFS-6865.2.patch, HDFS-6865.3.patch, HDFS-6865.4.patch, 
> HDFS-6865.5.patch, HDFS-6865.patch
>
>
> Refactor FSOutputSummer to buffer data and use the native checksum 
> calculation functionality introduced in HADOOP-10975.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6865) Byte array native checksumming on client side (HDFS changes)

Reply via email to