pitrou commented on code in PR #13820:
URL: https://github.com/apache/arrow/pull/13820#discussion_r951559039


##########
python/pyarrow/_dataset.pyx:
##########
@@ -1251,6 +1266,9 @@ cdef class CsvFragmentScanOptions(FragmentScanOptions):
 
     cdef:
         CCsvFragmentScanOptions* csv_options
+        # The encoding field in ReadOptions does not exist in the C++ struct.
+        # We need to store it here and override it when reading read_options
+        ReadOptions read_options_py

Review Comment:
   Same here.



##########
python/pyarrow/_dataset.pyx:
##########
@@ -1278,11 +1297,18 @@ cdef class CsvFragmentScanOptions(FragmentScanOptions):
 
     @property
     def read_options(self):
-        return ReadOptions.wrap(self.csv_options.read_options)
+        read_options = ReadOptions.wrap(self.csv_options.read_options)
+        if self.read_options_py is not None:
+            read_options.encoding = self.read_options_py.encoding
+        return read_options
 
     @read_options.setter
     def read_options(self, ReadOptions read_options not None):
         self.csv_options.read_options = deref(read_options.options)
+        self.read_options_py = read_options
+        if read_options.encoding != 'utf8':

Review Comment:
   Note that there can be aliases, for example:
   ```python
   >>> codecs.lookup('utf-8').name
   'utf-8'
   >>> codecs.lookup('utf8').name
   'utf-8'
   >>> codecs.lookup('UTF8').name
   'utf-8'
   ```
   
   



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -3130,6 +3130,58 @@ def test_csv_fragment_options(tempdir, dataset_reader):
         pa.table({'col0': pa.array(['foo', 'spam', 'MYNULL'])}))
 
 
+def test_encoding(tempdir, dataset_reader):
+    path = str(tempdir / 'test.csv')
+
+    for encoding, input_rows, expected_table in [

Review Comment:
   `expected_table` here is unused.



##########
python/pyarrow/_dataset.pyx:
##########
@@ -1171,6 +1178,10 @@ cdef class CsvFileFormat(FileFormat):
     """
     cdef:
         CCsvFileFormat* csv_format
+        # The encoding field in ReadOptions does not exist in the C++ struct.
+        # We need to store it here and override it when reading
+        # default_fragment_scan_options.read_options
+        public ReadOptions read_options_py

Review Comment:
   This does not need to be visible to the user, how about renaming it to 
stress it's an internal detail?
   ```suggestion
           public ReadOptions _read_options_py
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to