[ 
https://issues.apache.org/jira/browse/HADOOP-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476235
 ] 

Doug Cutting commented on HADOOP-928:
-------------------------------------

> the reason that I set the inner buffer very small is to by-pass the inner 
> buffer and hence avoid one more data copy

Yes, that makes sense, thanks for clarifying.  But unless I missed something, 
in ChecksumFileSystem#create(Path, int bufferSize), the inner and outer buffers 
are both bufferSize.

Also, a competing concern is that data not sit in buffers too long before it is 
checksummed.  Since we use many long-lived multi-megabyte buffers when sorting, 
this is a real concern.  So another strategy might be to use a small outer 
buffer and a large inner buffer, and assume that the cost of the extra copy is 
negligible (or at least warranted).  That way data would be checksummed sooner, 
and memory corruption in the client could be more reliably detected, but it 
does require an extra copy.  That was the strategy I assumed when I suggested 
using large inner buffers and small outer buffers.  It's probably worth 
benchmarking this at some point, although I'd rather not hold up this issue any 
longer.

So can you please just check whether my analysis of 
ChecksumFileSystem#create(Path, int bufferSize) above is correct?  Thanks!

> make checksums optional per FileSystem
> --------------------------------------
>
>                 Key: HADOOP-928
>                 URL: https://issues.apache.org/jira/browse/HADOOP-928
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>         Assigned To: Hairong Kuang
>         Attachments: checksum.patch, checksum1.patch, checksum2.patch
>
>
> Checksumming is currently built into the base FileSystem class.  It should 
> instead be optional, with each FileSystem implementation electing whether to 
> use the Hadoop-provided checksum system, or to disable it, or to implement 
> its own custom checksum system.
> To implement this, a ChecksumFileSystem implementation can be provided that 
> wraps another FileSystem implementation, implementing checksums as in 
> Hadoop's current mandatory implementation (i.e., as a separate crc file per 
> file that's elided from directory listings).  The 'raw' FileSystem methods 
> would be removed.  FSDataInputStream and FSDataOutputStream would be made 
> interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to