Rafferty97 commented on PR #20626:
URL: https://github.com/apache/datafusion/pull/20626#issuecomment-4494169844

   > Thank you for this PR @Rafferty97 and I am sorry for the delay in reviewing
   > 
   > I totally see the usecase for supporting non UTF-8 character sets. 
However, I worry that there will be no obvious place to stop (as in why not 
uuencoded or base64 encoded data, etc)
   > 
   > I realize the same argument could be applied to supporting compression but 
that is built in directly, but I am trying to keep the DataFusion feature set 
from growing out of control (and help mitigate growth in binary size)
   > 
   > What would you think about pulling out the "pre-parse transformation 
steps" into an API you could extend in an external system, rather than baking 
character set support into DataFusion?
   > 
   > That would look something like:
   > 
   > 1. Add some sort of API like `PreReadDecoder` or something
   > 2. Add a way to register those decoders / translators as an extension
   > 3. Refactor the existing compression codec's to use the new APIs
   > 4. Make an example showing how to use the new API to implement support for 
different character sets.
   > 
   > That being said I am not opposed to this feature per-se but I do think the 
binary size / avoiding feature bloat is an important discussion to have
   
   Hey Andrew, thanks for taking a look at my PR. The `encoding_rs` dependency 
should definitely be optional, unless I've made a mistake somewhere in my 
implementation, but I do take your point about keeping the library small and 
focused.
   
   I'd be open to extracting out the pre-parse transformation logic into its 
own API in a new PR, and I can have a go at extracting out the compression 
codec to use it as well, probably as a follow up PR. I'll start prototyping it 
soon and post an update once I've learnt more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to