[
https://issues.apache.org/jira/browse/HADOOP-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235954#comment-13235954
]
Kihwal Lee commented on HADOOP-8060:
------------------------------------
Sorry for the delay on this. I will get a set of initial patches up soon. But
here is one design decision I have to make and would appreciate any input on
this.
We need a way to specify the checksum type for create(). Currently the checksum
type used for creating DFSOutputStream is set based on dfs.checksum.type when a
DFSClient object is created. If there is no file system cache, users can
dictate the checksum type for a new file by setting the dfs.checksum.type
properly. This does not work when the file system cache is on. The following is
why:
- A DFSClient instance can be shared by many threads, so changing the shared
class variable can result in unpredictable behaviors.
- The FileSystem cache is only keyed on the scheme/authority and UGI. The
DFSClient object that was created by a DFS instance in the cache will retain
the conf that was used to instantiate it. If the same UGI is used, this
DFSClient will be used for all threads that acesses the same HDFS cluster. In
this case the threads cannot even change the behavior of DFSClient by changing
conf settings, even if we modify DFSClient so that it reads dfs.checksum.type
dynamically during create().
Turning cache off is not an option due to the potential resource exhaustion
issues on various part of systems.
So far, this is the only way I came up with that does not involve FileSystem
API change: Add checksum types to CreateFlag. The types already are defined in
DataChecksum, so the changes are contained in common. I was initially very
reluctant about this because I was comparing the flags to POSIX open flags. But
it seems less objectionable once I realized CreateFlag used for create() is
nothing like the POSIX one. :)
If I don't hear any other suggestion, I will prepare a set of patches based on
this. There will be sub-tasks and a separate blocking jira.
> Add a capability to use of consistent checksums for append and copy
> -------------------------------------------------------------------
>
> Key: HADOOP-8060
> URL: https://issues.apache.org/jira/browse/HADOOP-8060
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs, util
> Affects Versions: 0.23.0, 0.24.0, 0.23.1
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Fix For: 0.24.0, 0.23.2
>
>
> After the improved CRC32C checksum feature became default, some of use cases
> involving data movement are no longer supported. For example, when running
> DistCp to copy from a file stored with the CRC32 checksum to a new cluster
> with the CRC32C set to default checksum, the final data integrity check fails
> because of mismatch in checksums.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira