pan3793 commented on PR #51182: URL: https://github.com/apache/spark/pull/51182#issuecomment-2986934134
OK, it's fair enough to fork Hadoop's `LineRecordReader` and use try-catch logic for codec fallback given your flexible extending design. > There will be some performance penalty for examining the magic number and reopening the file input stream with appropriate codec. Have a wrapper Codec to look ahead Magic Number and transfer the InputStream to the concrete Codec should eliminate the re-open cost, anyway, this is about implementation details and can be discussed later. > For compatibility check, I have added some more tests that reads files compressed with ZSTD in ubuntu (version: 1.4.4+dfsg-3ubuntu0.1). I think you should at least test reading zstd text file written by Hadoop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
