shunping commented on issue #31040:
URL: https://github.com/apache/beam/issues/31040#issuecomment-2571382465

   > I agree with all of your proposals that replace "Data Loss" with 
"UnicodeDecodeError"
   
   Great! GCS decompressive transcoding is a bit unintuitive to Beam users 
here, and when it happens, we see data loss. I think it is more natural to 
expect users to specify GZIP or AUTO in those cases rather than UNCOMPRESSED, 
as shown in the proposal.
   
   > Basically if the user says it is GZIP and the data really is GZIP but we 
know that GCS is going to decode it then we do not do a redundant decode.
   
   Yep, that's the idea, but it is implemented it differently in the proposed 
fix (#33384). 
   
   - Instead of trying to determine whether GCS is going to do decompressive 
transcoding, which is both unclear and inconvenient to verify,  we can call GCS 
client library and let it always return raw data without transcoding. 
   - At first, I was worrying about performance of this approach. As the gzip 
decoding will then happen on our side (Beam), which is different from 
server-side decoding that is mentioned in 
https://cloud.google.com/storage/docs/transcoding.
   - However, after closely examining the GCS client library, I discover that 
the GCS client library actually **always** requests GCS for raw data. It then 
adds an extra decoding step in itself 
(https://github.com/googleapis/google-resumable-media-python/blob/402feb7b38a972daad9bd3e26b80ddd0879bd53f/google/resumable_media/requests/download.py#L127)
 before returning that to the caller (in this case Beam). It merely mimicks the 
effects of server-side decompressive transcoding, but the actual decoding 
workload is done on the client.
   
   In my fix (#33384), I let GCS client library skip the decoding step in 
itself and rely on our own decoding mechanism in CompressedFile to process the 
file. It works exactly as what has been proposed in the previous table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to