shunping commented on issue #31040: URL: https://github.com/apache/beam/issues/31040#issuecomment-2571382465
> I agree with all of your proposals that replace "Data Loss" with "UnicodeDecodeError" Great! GCS decompressive transcoding is a bit unintuitive to Beam users here, and when it happens, we see data loss. I think it is more natural to expect users to specify GZIP or AUTO in those cases rather than UNCOMPRESSED, as shown in the proposal. > Basically if the user says it is GZIP and the data really is GZIP but we know that GCS is going to decode it then we do not do a redundant decode. Yep, that's the idea, but it is implemented it differently in the proposed fix (#33384). - Instead of trying to determine whether GCS is going to do decompressive transcoding, which is both unclear and inconvenient to verify, we can call GCS client library and let it always return raw data without transcoding. - At first, I was worrying about performance of this approach. As the gzip decoding will then happen on our side (Beam), which is different from server-side decoding that is mentioned in https://cloud.google.com/storage/docs/transcoding. - However, after closely examining the GCS client library, I discover that the GCS client library actually **always** requests GCS for raw data. It then adds an extra decoding step in itself (https://github.com/googleapis/google-resumable-media-python/blob/402feb7b38a972daad9bd3e26b80ddd0879bd53f/google/resumable_media/requests/download.py#L127) before returning that to the caller (in this case Beam). It merely mimicks the effects of server-side decompressive transcoding, but the actual decoding workload is done on the client. In my fix (#33384), I let GCS client library skip the decoding step in itself and rely on our own decoding mechanism in CompressedFile to process the file. It works exactly as what has been proposed in the previous table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
