joosthooz commented on PR #13709: URL: https://github.com/apache/arrow/pull/13709#issuecomment-1201019805
I got it working but it isn't pretty. I had to duplicate the python-specific `encoding` field to both `CsvFileFormat` and `CsvFragmentScanOptions`, and modify the getter function `default_fragment_scan_options` in `FileFormat`! From https://github.com/apache/arrow/pull/13709/commits/4d9802b60480180e8deab6dd162a72d837cccdeb: ```It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent. 1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding. To the user it would be unclear which of these values would be eventually used by the transcoding. ``` This also seems pretty undesirable. What about if we added an option to allow the user to specify a library to do the transcoding in C++, that we then dynamically link to? Then we can just add the field to the C struct and the Python transcoding in this PR would also still work nicely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
