[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

GitBox Mon, 01 Aug 2022 03:32:24 -0700


joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1201019805


   I got it working but it isn't pretty. I had to duplicate the python-specific 
`encoding` field to both `CsvFileFormat` and `CsvFragmentScanOptions`, and 
modify the getter function `default_fragment_scan_options` in `FileFormat`!
   From 
https://github.com/apache/arrow/pull/13709/commits/4d9802b60480180e8deab6dd162a72d837cccdeb:
   ```It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions 
because if the user has a reference to these separate objects, they would 
otherwise become inconsistent.
   1 would report the default 'utf8' (forgetting the user's encoding choice), 
while the other would still properly report the requested encoding.
   To the user it would be unclear which of these values would be eventually 
used by the transcoding.
   ```
   This also seems pretty undesirable. 
   What about if we added an option to allow the user to specify a library to 
do the transcoding in C++, that we then dynamically link to? Then we can just 
add the field to the C struct and the Python transcoding in this PR would also 
still work nicely. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Reply via email to