Rafferty97 commented on PR #20626: URL: https://github.com/apache/datafusion/pull/20626#issuecomment-4494169844
> Thank you for this PR @Rafferty97 and I am sorry for the delay in reviewing > > I totally see the usecase for supporting non UTF-8 character sets. However, I worry that there will be no obvious place to stop (as in why not uuencoded or base64 encoded data, etc) > > I realize the same argument could be applied to supporting compression but that is built in directly, but I am trying to keep the DataFusion feature set from growing out of control (and help mitigate growth in binary size) > > What would you think about pulling out the "pre-parse transformation steps" into an API you could extend in an external system, rather than baking character set support into DataFusion? > > That would look something like: > > 1. Add some sort of API like `PreReadDecoder` or something > 2. Add a way to register those decoders / translators as an extension > 3. Refactor the existing compression codec's to use the new APIs > 4. Make an example showing how to use the new API to implement support for different character sets. > > That being said I am not opposed to this feature per-se but I do think the binary size / avoiding feature bloat is an important discussion to have Hey Andrew, thanks for taking a look at my PR. The `encoding_rs` dependency should definitely be optional, unless I've made a mistake somewhere in my implementation, but I do take your point about keeping the library small and focused. I'd be open to extracting out the pre-parse transformation logic into its own API in a new PR, and I can have a go at extracting out the compression codec to use it as well, probably as a follow up PR. I'll start prototyping it soon and post an update once I've learnt more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
