cozos commented on issue #24303: URL: https://github.com/apache/beam/issues/24303#issuecomment-1336700212
Something that caught my attention for this is the `blobstorageio.py` implementation of `checksum`, which uses the `etag` of the file: https://github.com/apache/beam/blob/d15913b289158bea685197db7752409886217441/sdks/python/apache_beam/io/azure/blobstorageio.py#L379 The purpose of the `checksum` method as I understand it is: "when committing a file from the temporary path to the final path, check if a file already exists in the final path. If it already exists, compare the checksum of the temporary file to the existing file in the final path, and if they are the same then another worker has already written that data and we are good to go. If they are different then something has gone wrong (implies possible dataloss) so we should abort". However, the `etag` says nothing about the contents of the file - from https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties#response-headers: ``` The ETag contains a value that you can use to perform operations conditionally. ``` (i.e. for optimistic concurrency) It seems to me like the `Content-MD5` header is what we actually want for the `checksum()` implementation, as [it is](https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.blob.blobproperties.contentmd5?view=azure-dotnet): ``` A string containing the blob's content-MD5 hash. ``` It is unclear if the ETag is actually a digest of the blob data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
